cul MTEGA 
Ey ANNAL" oh 


of 
STATISTICS 


AN OFFICIAL JOURNAL OF 
THE INSTITUTE OF MATHEMATICAL STATISTICS 


The 1984 Wald Memorial Lectures 
Boundary crossing probabilities and statistical applications .......... DAVID SIEGMUND 


Articles 


Maximum likelihood estimators and likelihood ratio criteria in multivariate components 
of variance .......... BLAIR M. ANDERSON, T. W. ANDERSON, AND INGRAM OLEIN 
ymptotic theory for common principal component analysis....... BERNARD N. FLURY 
niidence sets for a multivariate distribution ........... R. BERAN AND P. W. MILLAR 
Improved confidence sets for the coefficients of a linear model with spherically symmetric 
ee eee eee ee ee ee JIUNN TZON HWANG AND JEESEN CHEN 
Robust Bayes and empirical Bayes analysis with e-contaminated priors 
JAMES BERGER AND L. MARK BERLINER 
Characterization of externally Bayesian pooling operators 
CHRISTIAN GENEST, KEVIN J. MCCONWAY AND MARK J. SCHERVISH 
A Bayes procedure for the identification of univariate time series models ...D. S. POSKITT 
Marge sampe properties of parameter estimates for ney de eae ERE 
Gaussian time series .......sesosernsessne ROBERT FOX AND MURA TAQQU 
Limit theory for the sample covariance and correlation functions of moving 
a e E E A E RICHARD DAVIS AND SIDNEY RESNICK 
Statistical estimation of the parameters of a moving source from array data 
SHEAN-TSONG CHTU 
Estimation for a semimartingale regression model using the method of sieves 
Tan W. MCKEAGUR 
The dimensionality reduction principle for generalized additive models 
CHARLES J. STONE 
Time sequential estimation of the exponential mean under random withdrawals 
JOSEPH C. GARDINER, V. SUSARLA AND JOHN VAN RYZIN 
Empirical processes associated with V-statistics and a class of estimators under 


random CONSOMING. scisso rinita eos Sle ck wea eee eee ese MICHAEL G. AKRITAS 
Conditional empirical processeB .... ee eee te ee ene WINFRIED STUTE 
Large deviations of estimators .......... A. D. M. KESTER AND W. C. M. KALLENBERG 
The statistical information contained in additional observations ........ ENNO MAMMEN 
An approach to upper bound problems for risks of generalized least squares 

esUmators 4.05.0 5-s esis ee A YASUYUKI TOYOOKA AND TAKEAKI KARIYA 
Intergroup diversity and concordance for ranking data: An approach via metrics 

for permutations: <6.2.4264iaw eh be 2c ek Sieve SRS PAUL D, FEIGIN AND MAYER ALVO 


Testing for normality in arbitrary dimension ........ 0.0... eee eee SANDOR CsdRG6 
Minimax variance /M-estimators of location in Koloraro neighbourhoods 
Douc WIENS 
On optimal decision rules for signs of parameters 
YOSEF HOCHBERG AND MARC E. POSNER 


Orthogonality of factorial effects .............. CHAND K. CHAUHAN AND A, M. DEAN 
Short Communications 
An Efron—Stein inequality for nonsymmetric statisticg............ J. MICHAEL STEELE 


Chi-square goodness-of-fit tests for randomly censored data 
M. G. HABIB AND D. R. THOMAS 
Some asymptotic properties of kernel estimators of a density function in case of 


CONBOPER CCB ie ord eE EEO EE ete wae we JAN MIELNICZUK 
A large deviation result for signed linear rank statistics under the symmetry 
hypothesis sesura an EE tae ae oe Ue ee ES wee whee oes TIEE-JIAN WU 


Vol. 14, No. 2— June, 1986 


361 


619 


FE 


743 


THE INSTITUTE OF MATHEMATICAL STATISTICS 


(Organized September 12, 1935) 


The purpose of the Institute of Mathematical Statistics ıs to encourage the 
development, dissemination, and application of mathematical statistics. 


OFFICERS AND EDITORS 
President: 


Paul Meier, Department of Statistics, University of Chicago, 5734 University Avenue, 
Chicago, Illinois 60637 


President-Elect: i 
Ronald Pyke, Department of Mathematics GN-50, University of Washington, Seattle, 
Washington 98195 

Past President: 


Oscar Kempthorne, Department of Statistics, 111B Snedecor Hall, Iowa State University, 
Ames, Iowa 50011 

Executive Secre i 
Francisco J. Samaniego, Division of Statistics, 469 Kerr Hall, Univermty of California, 
Davis, California 95616 

Treasurer: 
Nicholas P. Jewell, Group in Biostatistics, University of California, Berkeley. Please send 
correspondence to: | Business Office, 3401 Investment Boulevard #7, Hayward, 
California 94545 

Program Secretary: 

Richard A. Johnson, Department of Statistics, University of Wisconsin, 1210 West Dayton 
Street, Madison, Wisconsin 53706 

Editor: The Annals of Statistics 
Willem R. van Zwet, Department of Mathematics, University of Leiden, P.O. Box 9512, 
2300 RA Leiden, The Netherlands 

Editor: The Annals of Probab 
Thomas M. Liggett, Department of Mathematics, University of Calforma, Los Angeles, 
California 90024 

Executive Editor: Statstical Science 
Morris H. DeGroot, Department of Statistics, Carnegie-Mellon University, Pittsburgh, 
Pennsylvania 16213 

Editor: The IMS Bulletin 
William C. Guenther, artment of Statistics, University of Wyoming, Laramie. Please 
send correspondence to: Box 3332 University Station, Laramie, Wyoming 82071 

Editor: The IMS Lecture Notes Monograph Series 
Shanti S. Gupta, Department of Statistics, Purdue University, West Lafayette, Indiana 
47907 


Managing Editor: 
aul Shaman, Department of Statistics, University of Pennsylvania, Philadelphia, 
Pennsylvania 19104 


Journals. The scientific journals of the Institute are The Annals of Statistics, The Annals of 
Probabuity, and Statistical Science. The news organ of the Institute is The Institute of Mathe- 
matical Statistics Bulletin. 


Individual, Institutional, and Corporate Memberships. All tndiwtdual members receive Sta- 
tstical Science and The IMS Bulletin for a basic annual dues rate of $30. Individual members 
may elect to receive one Annals for an additional $10 or both Annals for an additional $20. Dues 
allocations to each journal are set by Council resolution. Of the total dues paid, $8 is allocated to 
The IMS Bulletin and the remaining amount 1s allocated nan among the scientific journals 
received. Memberships are available at a reduced rate (40% of regular rates) for full-time students, 
permanent residents of developing countries, and retired members. Retired members may also 
elect to receive the Bulletin only for $10. Institutional memberships are available to nonprofit 
organizations at $225 per year and corporate memberships are available to other organizations at 
$500 per year. Institutional and corporate memberships include two multiple-readership copies of 
all IMS journals in addition to other benefits specified for each category (details available from 
the IMS Business Office). 


Individual and General Subscriptions. Subscriptions are available on a calendar-year basis. 
For 1986, all subscriptions to one or both Annals automatically include one subscription to 
Statistical Science, Individual subscriptions are for the personal use of the subscriber and must be 
in the name of, paid directly by, and mailed to an individual. Individual subscriptions for 1986 are 
available to Both Annals and Statistical Sctence ($79), one Annals and Statistical Sctence ($52), 
Statistical Science only ($25), and The IMS Bulletin ($15). General subscriptions are for libraries, 
institutions, and any multiple-readership use. General subscriptions for 1986 are available to both 
Annals and Statistical Science ($155), The Annals of Statistics and Statistical Science ($85), The 
Annals of Probabuity and Statistical Science ($80), Statistical Science only ($40), and The IMS 
Bulletin ($20). Air mail rates for overseas delivery of general subscriptions are $40 per title. 


Correspondence. Mail to IMS should be sent to the IMS Business Office (membership, subscrip- 
tions, claims, copyright permissions, advertising, back issues), the Editor of the appropriate 
journal (submissions, editorial content), or the Managing Editor (production). 


The Annals of Statistics (ISSN 0090-5364), Volume 14, Number 2, June 1986. Published in 
March, June, September, and December by the Institute of Mathematical Statistics, 3401 
Investment Boulevard #7, Hayward, California 94545. Second-class postage paid at Hayward, 
California and at additional mailing offices. Postmaster: Send address changes to The Annals 
of Statistics, Institute of Mathematical Statistica, 3401 Investment Boulevard #7, Hayward, 
California 94545. 
Copyright © 1986 by the Institute of Mathematical Statistics 
Printed in the United States of America 


f 


my 





P2 9516 


EDITORIAL STAFF 
EDITOR 
WILLEM R, VAN ZWET 
ASSOCIATE EDITORS 

JAMES O. BERGER Davin F. FINDLEY Bruce G. LINDSAY 
PETER J. BICKEL RICHARD D. GILL DAVID POLLARD 
LAWRENCE D. BROWN PIET GROENEBOOM DAVID O. SIEGMUND 
RAYMOND J. CARROLL PETER HALL A. F. M. SMITH 
CHING-SHUI CHENG R. Z. HASMINSKII TERRY SPEED 
Dennis D. Cox NIELS KEIDING JON A. WELLNER 
Morris L. EATON STEFFEN L. LAURITZEN MICHAEL WOODROOFE 

EDITORIAL ASSISTANT 


LUCIE W. VAN ZWET 


MANAGING EDITOR 


eee a, 


PAUL SHAMAN | = 
EDITORIAL ASSISTANT : Q 
TAMMY MORROW n Ai 
~% T3 
Past EDITORS 
THE ANNALS OF MATHEMATICAL STATISTICS 
H. C. CARVER, 1930-1938 WILLIAM KRUSKAL, 1958-1961 
S. S. WiLKs, 1938-1949 J. Le HODGES, JR., 1961-1964 
T. W. ANDERSON, 1950-1952 I). L. BURKHOLDER, 1964-1967 
E. L. LEHMANN, 1953-1955 Z. W BIRNBAUM, 1967-1970 
T. E. HARRIS, 1955-1958 INGRAM OLKIN, 1970-1972 
THE ANNALS OF STATISTICS THE ANNALS OF PROBABILITY 
INGRAM OLKIN, 1972-1973 RONALD PYKE, 1972-1975 
I. R. SAVAGE, 1974-1976 PATRICK BILLINGSLEY, 1976—1978 
RUPERT G. MILLER, JR., 1977-1979 R. M. DUDLEY, 1979-1981 
DAVID V. HINKLEY, 1980-1982 HARRY KESTEN, 1982-1984 


MICHAEL I). PERLMAN, 1983-1985 


EDITORIAL POLICY 


The main purpose of The Annals of Statistics and The Annals of Probability is to pubhsh 
significant contributions to the theory of statistics and probability and their applications. The 
emphasis is on importance and interest; formal novelty and mathematical correctness alone are 
not sufficient for publication. Especially appropriate are authoritative expository papers and 
surveys of areas in vigorous development. Because statistics is an evolving discipline, the Editors 
of The Annals of Statistcs take a broad view of its domain and welcome papers ın interface areas. 
Contributors to The Annals of Statistics should review the editorial in the January 1980 issue. All 
papers are refereed. 


‘NOTICE 


Meno submitted for The Annals of Statistics should be sent to the Editor at the ` 
following address: 


Willem R. van Zwet 

De ent of Mathematics 
University of Leiden 

P. O. Box 9512 

2300 RA Leiden 

The Netherlands 


IMS CORPORATE MEMBERS 


THE AEROSPACE CORPORATION INTERNATIONAL BUSINESS MACHINES CoRP 
Los Angeles, California Thomas J. Watson Research Center 
Yorktown Heights, New York 


SPRINGER-VERLAG NEW YORK INCORPORATED 
New York, New York 


BELL COMMUNICATIONS RESEARCH UNION OIL COMPANY OF CALIFORNIA 
Mornstown, New Jersey Brea, California 


AT & T BELL LABORATORIES 
Murray Hill, New Jersey 


GENERAL MOTORS CORPORATION 
Research Laboratories 
Warren, Michigan 


IMS INSTITUTIONAL MEMBERS 


ARIZONA STATE UNIVERSITY MICHIGAN STATE UNIVERSITY 
Dept of Mathematics Dept of Statistics and Probability 
Tempe, Arizona East Lansing, Michigan 
AUSTRALIAN NATIONAL UNIVERSITY NATIONAL SECURITY AGENCY 
Canberra, ACT, Austraha Fort George G. Mead, Maryland 
BOWLING GREEN STATE UNIVERSITY New MEXICO STATE UNIVERSITY 
Dept of Mathematics and Statistics Dept of Mathematical Sciences 


Bowhng Green, Ohio 


CALIFORNIA STATE UNIVERSITY 

AT FULLERTON 
Depts of Math and Management Science 
Fullerton, California 


Las Cruces, New Menco 


NORTH CAROLINA STATE UNIVERSITY 
Dept of Statistics 
Raleigh, North Carolina 


NORTHERN ILLINOIS UNIVERSITY 


SERVE UNIVERSI 
CASE WESTERN RESERVE U RSITY Deot oi Makereta S dae 


Dept of Mathematics and Statıstıcs 


Cleveland, Ohio DeKalb, Illinois 
CENTERS FOR DISEASE CONTROL NORTHWESTERN UNIVERSITY 
Atlanta, Georgia Dept of Mathematica 


Evanston, Illinois 
CORNELL UNIVERSITY 
Dept of Mathematics OHIO STATE UNIVERSITY 
Ithaca, New York Dept of Statistics 
Columbus, Ohio 
FLORIDA STATE UNIVERSITY 
Dept of Statistics OREGON STATE UNIVERSITY 
Tallahassee, Florida Dept of Statistics 
Corvallis, Oregon 
GEORGE WASHINGTON UNIVERSITY 
Dept of Statistics PENNSYLVANIA STATE UNIVERSITY 
Washington, DC Dept of Statistics 
INDIANA UNIVERSITY University Park, Pennsylvania 
Dept of Mathematics PRINCETON UNIVERSITY LIBRARY 
Bloomington, Indiana Princeton, New Jersey 


IOWA STATE UNIVERSITY 


Dept of Stat and Statistical Lab PURDUE UNIVERSITY LIBRARY 


West Lafayette, Indiana 


Ames, lowa 

JOHNS HOPKINS UNIVERSITY QUEEN’S UNIVERSITY 

Dept of Biostatistics Dept of Mathematics and Statistics 
Baltamore, Maryland Kingston, Ontano, Canada 
KANSAS STATE UNIVERSITY RICE UNIVERSITY 

Dept of Statistica Dept of Mathematical Sciences 
Manhattan, Kansas Houston, Texas 

MARA INSTITUTE OF TECHNOLOGY THE ROCKEFELLER UNIVERSITY LIBRARY 
Selangor, West Malaysia New York, New York 
MASSACHUSETTS INSTITUTE OF TECHNOLOGY SIMON FRASER UNIVERSITY 

Dept of Mathematics Dept of Mathematics and Statistics 
Cambridge, Massachusetts Burnaby, Bntish Columbia, Canada 
MIAMI UNIVERSITY SOUTHERN ILLINOIS UNIVERSITY 
Dept of Mathematics and Statistics Dept of Math, Stat, and Comp Sci 


Oxford, Ohio Edwardsville, Illinois 


SOUTHERN METHODIST UNIVERSITY 
Dept of Statistics 
Dallas, Texas 


STANFORD UNIVERSITY 
Dept of Statistics 
Stanford, California 


TEMPLE UNIVERSITY 
Dept of Mathematics 
Philadelphia, Pennsylvania 


TEXAS TECH UNIVERSITY 
Dept of Mathematics 
Lubbock, Texas 


UNIVERSITY OF ARIZONA 
Dept of Mathematics 
Tucson, Anzona 


UNIVERSITY OF BRITISH COLUMBIA 
Dept of Statistics 
Vancouver, British Columbia, Canada 


UNIVERSITY OF CALGARY 
Division of Statistics 
Calgary, Alberta, Canada 


UNIVERSITY OF CALIFORNIA 
Dept of Statistics 
Berkeley, California 


UNIVERSITY OF CALIFORNIA 
Division of Statistics 
Davis, California 


UNIVERSITY OF GUELPH 
Dept of Mathematics and Statistics 
Guelph, Ontario, Canada 


UNIVERSITY OF ILLINOIS 
Department of Statistics 
Urbana, Ilhnais 


UNIVERSITY OF ILLINOIS AT CHICAGO 
Dept of Math, Stat, and Comp Sci 
Chicago, Uhnois 


UNIVERSITY OF IOWA 
Dept of Statistics and Actuanal Sa 
lowa City, lowa 


UNIVERSITY OF MANITOBA 
Dept of Statistics 
Winnipeg, Manitoba, Canada 


UNIVERSITY OF MARYLAND 
Dept of Mathematics 
College Park, Maryland 


UNIVERSITY OF MASSACHUSETTS 
Dept of Mathematica and Statistics 
Amherst, Massachusetts 


UNIVERSITY OF MICHIGAN 
Dept of Statistics 
Ann Arbor, Michigan 


UNIVERSITY OF MINNESOTA 
School of Statistica 
Minneapolis, Minnesota 


UNIVERSITY OF MISSOURI AT COLUMBIA 
Dept of Statistics 
Columbia, Missouri 


UNIVERSITY OF MONTREAL 
Dept of Mathematics 
Montreal, Quebec. Canada 


UNIVERSITY OF NEBRASKA 
Dept of Mathematics and Statistics 
Lincoln, Nebraska 


UNIVERSITY OF NEW MEXICO 
Dept of Mathematics and Statistics 
Albuquerque, New Mexico 


UNIVERSITY OF NORTH CAROLINA 
Dept of Statistics 
Chapel Hall, Nortn Carolina 


UNIVERSITY OF OREGON 
Dept of Mathematics 
Eugene, Oregon 


UNIVERSITY OF OTTAWA 
Dept of Mathematics 
Ottawa, Ontario, Canada 


UNIVERSITY OF SOUTH CAROLINA 
Dept of Statistics 
Columbia, South Carolina 


UNIVERSITY OF STOCKHOLM 
Inst of Actuarial Math and Math Stat 
Stockholm, Sweden 


UNIVERSITY OF TEXAS AT AUSTIN 
Dept of Mathematica 
Austin, Texas 


UNIVERSITY OF TEXAS AT SAN ANTONIO 
Div of Math, Comp Sci, and Systems Design 
San Antonio, Texas 


UNIVERSITY OF VICTORIA 
Dept of Mathematics 
Victoria, British Columbia, Canada 


UNIVERSITY OF VIRGINIA 
Dept of Mathematics 
Charlottesville, Virginia 


UNIVERSITY OF WASHINGTON 
Depts of Statistica and Mathematics 
Seattle, Washington 


UNIVERSITY OF WATERLOO 
Dept of Statistics and Actuarial Sci 
Waterloo, Ontano, Canada 


UNIVERSITY OF WISCONSIN AT MILWAUKEE 


Dept of Mathematical Sciences 
Milwaukee, Wisconsin 


VIRGINIA COMMONWEALTH UNIVERSITY 
Dept of Mathematical Sciences 
Richmond, Virginia 


VIRGINIA POLYTECHNIC INSTITUTE 
AND STATE UNIVERSITY 

Dept of Statistica 

Blacksburg, Virginia 


WAYNE STATE UNIVERSITY 
Dept of Mathematics 
Detroit, Michigan 


YORK UNIVERSITY 
Dept of Mathematics 
Downsview, Ontario, Canada 


THE ANNALS OF STATISTICS 


INSTRUCTIONS FOR AUTHORS 


Submission of Papers. Papers to be submitted for 
publication should be sent to the Editor of The Annals 
of Statistics. (For current address, see the latest issue 
of the Annals.) Four copies should be submitted on 
paper that will take ink corrections. The manuscript 
will not normally be returned to the author, when 
expressly requested by the author, one copy of the 
manuscript will be returned. All manuscripts should be 
accompanied by a cover letter. 


Preparation of Manuscripts. Manuscripts should be 
typewritten, entirely double-spaced, including refer- 
ences, with wide margins at sides, top and bottom. All 
copies must be completely legible. When technical 
reports are submitted, all extraneous sheets and covers 
must be removed. Typists should check an issue of the 
Annals for style. 


Submission of Reference Papers. Four copies of 
unpublished or not easily available papers cited in the 
manuscript should be submitted with the manuscript. 


Title. The title should be descriptive and as concise as 
is feasible, i.e., 1t should indicate the topic of the paper 
as clearly as possible, but every word ın it should be 
pertinent. 


Abbreviated Title. An abbreviated title to be used as 
a running head is also required. This should normally 
not exceed 35 characters. For example, an erticle with 
the title “The Curvature of a Statistical Model, with 
Applications to Large-Sample Likelihood Methods,” 
could have the running head, “Curvature of Statistical 
Model” or possibly “Asymptotics of Likelihood Meth- 
ods,” depending on the emphasis to be conveyed. 


Affiliation. Indicate your present institutional afiha- 
tion as you would lke ıt to appear. 


Summary. Each manuscript is required to contain a 
summary, clearly separated from the rest of the paper, 
which will be printed immediately after the title. Its 
main purpose is to inform the reader quickly of the 
nature and results of the paper, ıt may also be used as 
an aid in retrieving information The length of a 
summary will clearly depend on the length and dif- 
ficulty of the paper, but in general it should not exceed 
150 words. Formulas should be used as sparingly as 
possible within the summary. The summary should 
not make reference to results or formulas in the body 
of the paper—it should be self-contained. 


Footnotes. Footnotes should not be used, except as 
described under Title Page Footnotes below. Such 
information should be included within the text. 


Title Page Footnotes. Included as a footnote on page 
1 should be the headings: 


American Mathematical Society 1980 subject clas- 
sifications. Primary—, secondary—. 


found with instructions for its use in the Mathemat- 
cal Reviews Annual Subject Index-1980. The key words 
and phrases should describe the subject matter of the 
article, generally they should be taken from the body 
of the paper. 


Acknowledgment of support. Grants and contracts 
should also be included in this footnote. 


Identification of Symbols. Manuscripts for pub- 
lication should be clearly prepared to insure that all 
symbols are properly identified. Distinguish between 
“oh” and “zero”; “ell” and “one”; “epsilon” and “ele- 
ment of”; “summation” and “capital sigma,” etc. Indi- 
cate also when special type is required (Greek, Ger- 
man, script, boldface, etc.), unless indicated otherwise, 
formula letters will be set in italics. Acronyms should 
be introduced sparingly. Any handwritten symbols 
should be clearly identified 


Figures and Tables. Figures, charts, and diagrams 
should be prepared in a form suitable for photographic 
reproduction and should be professionally drawn twice 
the size they are to be omnted. (These need not be 
submitted until the paper has been accepted for pub- 
lication ) The printer does not improve upon the qual- 
ity of the figures submitted. Tables should be typed on 
separate pages with accompanying footnotes ım- 
mediately below the table. 


Formulas. Fractions in ihe text are preferably written 
with the solidus or negative exponent, thus, 
a+ 


(a + b)/(c + d) ıs preferred to paner. 


and (2r)! 
L 
2r’ 
to a® and a, , respectively. Complicated exponentials 


should be represented with the symbol exp. A frac- 
tional exponent is preferable to a radical sign. 


or 1/(217) to Also, a and Ape) are preferred 


References. References should be typed double-spaced 
and should follow the style: 


Kiefer, J. C. (1976). Admissibility of conditional con- 
fidence procedures. Ann. Statist. 4 836-865. 


In textual matenal, the format “... Keifer 
(1976)...” should be used. Multiple references can be 
distinguished as “... Kiefer (1976a)...”. Abbreviations 
for journals should be taken trom a current index issue 
of Mathematical Reviews. 


Addresses. The permanent address of each author 
should be typed following the references. 


Galley Proofs. The author will ordinarily receive gal- 
ley proofs. Corrected galley proofs should be sent to 
AOS Redactery, Science Typographers, Inc., 15 
Industrial Boulevard, Medford, NY 11763. 


“ng 


Key words and phrases. Correspondence. All correspondence with the Editor 
must refer to the manuscript number of the paper. This 
number will be on the card sent to the author acknow!- 


edging receipt of the article. 


The classification numbers representing the pri- 
mary and secondary subjects of the article may be 


The Annat of Statistics 
1986, Vol 11, No 2, 361-404 


THE 1984 WALD MEMORIAL LECTURES 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL 
APPLICATIONS’ 


By DAVID SIEGMUND 
Stanford University 


This paper surveys recent results involving boundary crossing probabili- 
ties and related statistical applications. The first part is concerned with 
problems of sequential analysis, especially repeated significance tests and 
their application to sequential clinical trials involving survival data. The 
second part develops the probability theory motivated by the problems of 
Part 1. A method for computing first passage distributions of Brownian 
motion to linear boundaries 18 introduced and then modified to handle 
problems problems in discrete time and those involving nonlinear boundaries. 
The third part is concerned with fixed sample statistical problems, especially 
change-point problems, which involve boundary crossing probabilities. Exam- 
ples are given of problems for which the methods of Part 2 appear adequate 
and of problems which require new methods. 


0. Introduction. Let X(t), t= 1,2,... or 0 < t < œ, be a stochastic pro- 
cess and let c(#) be constants. The general subject of this paper is approximate 
computation of boundary crossing probabilities of the form 


(0.1) P{ X(t) = c(t) for some my < t< m} 
(0.2) P{ X(t) = c(t) for some my < t < m|X(m) = £} 


and statistical applications of the resulting approximations. 

The grandfather of all such problems in statistics is to determine the distribu- 
tion of the one sample Kolmogorov-Smirnov statistic, which is of the form (0.1) 
with X(t) the difference between the empirical and true distribution function of 
a random sample and c(t) identically constant. The limiting distribution of this 
statistic is of the form (0.2) with X(t) a Brownian motion process, c(t) identi- 
cally constant, and § = 0. The principal contemporary motivation for studying 
such problems comes from sequential analysis, which is the context in which 
many of the results discussed below first arose. 

The paper is divided into three parts. The first 1s concerned with a class of 
problems in sequential analysis which lead naturally to problems of the form 
(0.1) and (0.2). The discussion in Part 1 is restricted to statistical issues. We shall 
in effect assume that we can compute without difficulty various boundary 
crossing probabilities that arise. However, these problems motivate the second 


Received January 1985; revised July 1985. 

"This work was supported by Office of Naval Research Contract N00014-77-C-0306 (N R-042-373) 
and National Science Foundation Grant MCS80-24649. 

AMS 1980 subject classyication. Primary 62110. 

Key words and phrases. Sequential test, change point, Brownian motion, large deviations. 


361 


362 D SIEGMUND 


part of the paper, which is concerned with the mathematical problem of 
approximation of boundary crossing probabilities. The third part discusses a 
number of nonsequential statistical problems which also lead to boundary 
crossing probabilities, some of which are essentially already solved in Part 2, and 
some of which require the development of new methods. Particular attention is 
given to so-called “change-point” problems. The results here are in some respects 
less complete and outline a program for future research. 

Because the subject of bour:dary crossing probabilities is quite technical, to 
convey the main ideas the following discussion is frequently heuristic and 
restricted to special cases. References are given to mathematically rigorous 
treatments. I have written a monograph on sequential analysis (Siegmund, 
1985b), which describes in substantially more detail most of the results of Parts 1 
and 2. 


1. Sequential analysis. The primary impetus for the development of 
sequential analysis during the 1940’s was a desire for more efficient methods of 
sampling inspection. Recent developments have been motivated at least in part 
by ethical considerations in the design of clinica! trials. 


1.1. Repeated significance tests for normal data. We shall consider in detail 
the following very simple model of a clinical trial. In order to compare two 
treatments, A and B, patients arrive sequentially and are paired, with one 
patient in each pair receiving treatment A and the other treatment B. Let a, (b,) 
denote the (immediate) response of the patient in the zth pair who receives 
treatment A (B), and let x, = a, — b, Assume that x,, x,,... are independent 
and normally distributed random variables with mean p and known variance, 
which without loss of generality can be assumed equal to 1. Our primary goa! is 
to test the hypothesis of no treatment effect, H,: u = 0, against the alternative 
H,: p # 0. Of course the standard fixed sample size test (at level 0.05) based on a 
sample of size n is to compute S, = x, + --: +x, and reject H if |S,| = b’n'” 
(b’ = 1.96). 

If u is considerably different from 0, indicating that one of the two treatments 
is considerably superior to the other, it is desirable to ascertain this fact with a 
minimum amount of experimentation, so that all future patients can receive the 
(apparently) superior treatment. On the other hand if u is about equal to 0, there 
is no ethical mandate (although there may be a financial one) to stop sampling as 
soon as possible. A sequential test designed to stop sampling as soon as it is 
apparent that H, is true while behaving like a fixed sample test if H, appears to 
be true is the so-called repeated significance test, defined as follows (cf. Armitage, 
1975). 

Given m,, m, and b > 0, define the stopping rule 


(1.1) T = inf{n: n> my, |S,| = bn'”*}. 


Stop sampling at min(T, m) and reject H, if and only if T < m. The power of 


oa 


rt 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 363 


this test is 

(1.2) P{Tsm}= P| U (|S, 2 on'7}). 
Its expected sample size is 

(1.3) E,,|min(T, m)], 


which we anticipate will be small when |u| > 0 and about equal to m when 
p= 0. 


REMARK. Note that the stopping rule (1.1) can be written 
T = inf{n: n= mg, S?/2n = b*/2}, 
or 
(1.4) T = inf{n: n> my, A, > a}, 


where A, is the log likelihood ratio statistic for testing u = 0 against u + 0, and 
a = b*/2. This observation is very helpful in adapting the results developed here 
to different situations. For example, if we drop the hypothesis that the variance 
of the x’s is known, (1.4) becomes 


T= inf{n: n> m,,(n/2)log(1 + X Jo) > a}, 


where X, =n 'D"x, and v? = n` 'E?(x, — X,)*. Much of the theory developed 
for tests based on (1.1) in normal families can be adapted to tests defined by (1.4) 
in multiparameter exponential families (Woodroofe, 1978; Lalley, 1983; Hu, 
1985). 


In large multicenter clinical trials it does not appear feasible to monitor the 
accumulating data continuously, so it is convenient to consider also group 
repeated significance tests, in which we suppose that each observation x, is 
actually the sum of several, say k, observations which constitute the ith group, 
This does not affect the theoretical developments that follow, since x, is still 
normally distributed (and may be approximately normally distributed even if 
the individual observations are not), but it does mean that a small value of the 
parameter m can represent a large sample size if the group size & is large. Also, 
for a group sequential test with group size k the real expected sample size is k 
times the quantity in (1.3). As we shall see below, the performance of a group 
sequential test is fairly insensitive to the group size, provided the ratio m/mzy, is 
kept fixed and the stopping boundary is adjusted so that one always compares 
tests of a given significance level. Of course, this remark is untrue in extreme 
cases, e.g., m = 2 and k large. See also Pocock (1977) and Siegmund (1985b). 

Before presenting a numerical example, it is convenient to introduce a modifi- 
cation of the repeated significance test defined above, which does have an 
important impact on its operating characteristics. As Table 1 illustrates, a 
repeated significance test can have a much smaller expected sample size for large 
|u] than a fixed sample test of sample size m, but the price it pays is a 


364 D SIEGMUND 


TABLE 1 
Repeated significance test: b = 2.413, m, = 1, m = 5, 
and a = 0.049 (0.05) 


Power of fixed 
u Power (1.2) E(T ^A m) sample test 
2.071 0.99 (0.99) 1.93 (2.05) 1.00 
1.759 0.95 (0.95) 2.43 (2.53) 098 
1.592 0.91 (0.90) 2.76 (2.84) 0.96 
1.311 0.76 (0.75) 3.35 (3.41) 0.83 
0.994 0.52 (0.50) 4,02 (4.04) 0 60 


Parenthetical entries from Pocock (1977). 


considerable loss of power. To recapture most of this lost power at a relatively 
small increase in the expected sample size, consider the following family of tests 
which interpolate between fixed sample tests and repeated significance tests. 
Given 0<c< b, let T be defined by (1.1). Stop sampling at min(T, m) and 
reject H, if either T < m or T > m and |S,,| = cem. The power of this test is 


- P{T < m} + PT > m,|S,,| 2 em} 
| = P{|S,| = cm} + P{T < m, [Sql < om}. 


Of course the value of b must now be somewhat larger than previously if the 
overall significance level is to be unchanged, but by taking c only slightly larger 
than the rejection level of a fixed sample test, one makes the power essentially 
equal to the first term on the right-hand side of (1.5), which in this case is about 
the same as for the fixed sample test. See Figure 1 and Table 2 below. This 
modification of the repeated significance test was suggested independently, with 
varying motivation, by Haybittle (1971), Peto et al. (1976), and Siegmund (1978). 

Tables 1-3 contain numericel examples. The power function has been ap- 
proximated according to the suggestions in Part 2 of this paper. For comparison, 
results obtained by Pocock (1977) by iterative numerical integration are included 
in parentheses, when available. Approximations to the expected sample size are 
computed according to the suggestion of Siegmund (1985b, (4.42)), which is not 
discussed here. Those cells which contain an asterisk are combinations of b, m, 
and p for which the approximation to the expected sample size is poor. Table 1 is 
concerned with a repeated significance test having power function given by (1.2). 
It is easy to see that there is a substantial savings in the expected sample size 
when |u| = 0 compared to a fixed sample test taking m observations. To 
document the loss of power of the repeated significance test, the power of a fixed 
sample test taking m observations is also included in the table. Table 2 is 
concerned with a modified test having power function given by (1.5). Now there 
is essentially no loss of power, but still a quite considerable savings in the 
expected sample size. In order to compare Table 3 with Table 2, one should think 
of Table 2 as defining a group sequential test with k = 10 observations per 
group. Then the values given for u in the two tables are comparable (i.e., a value 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 





> 
i 
| ` 
——— Stop and Reject Ho 
Stop and Do Not Reject Ho 
FIG. | 
TABLE 2 
Modified repeated significance test. 6 = 2.7, ¢ = 2.04, 
my 1, m = 5, and a = 0.0504 
m Power (1.5) ET ^Am) PT sm) 
2.071 1.00 2.26 0.98 
1.759 0.97 2.83 0.91 
1.592 0.94 3.18 0.84 
= 1.311 0,82 3.76 ().66 
j 0.994 0.59 * 0.40 
TABLE 3 


Modified repeated significance test: 6 = 2.91, c = 2.05, my 
= 10, m = 50, and a = 0.0503 


m Power ELT ^A m) P{T <m) 
0.655 1.00 19 0.97 
0.556 0.97 25 0.89 
0.503 0.94 29 0.81 
0.415 0.82 35 0.62 
0.314 0.59 * 0.36 


365 


366 D SIEGMUNI) 


in Table 3 equals the corresponding value in Table 2 divided by k!”* = 3.16) and 
the expected sample sizes are comparable if one multiples the entries in Table 2 
by k = 10. To the accuracy of the approximations used, the group test has the 
same power function and just a slightly larger expected sample size than the test 
which inspects the data continuously. 


REMARK 1.6. It is easy to devise other tests which behave about the same as 
the modified repeated significance test discussed here. One possibility, suggested 
independently by Miller (1970), Samuel-Cahn (1974), and O’Brien and Fleming 
(1979), is to stop at min( N, m), where N = min{n: |S,] > B}, and to reject Hy: 
np = 01f N <m. While the properties of these tests are simular to those of the 
modified repeated significance tests defined above, they seem less convenient in 
certain respects. For example, because the parameter 6 of a (modified) repeated 
significance test is less sensitive to the value of m than is B, which is roughly 
proportional to m'/*, someone who knows the current values of n and S, has 
some idea whether |S,|/n'/? is close to b even if he does not know the maximum 
sample size m, or what may be more important, even if n is relatively small and 
the experimenters have not yet settled on a value for m. See also Section 1.3. 


A modified repeated significance test is designed to produce a fixed sample size 
m unless there is a substantial treatment effect as measured by the parameter of 
primary interest, u. We assume that if |u| is large, the preference for one 
treatment is so strong that other considerations are essentially irrelevant. How- 
ever, there typically are other measures of treatment effect which one wants to 
explore, especially if u = 0; but because of their secondary importance they do 
not enter into the definition of the stopping rule. There are undoubtedly also 
cases where if u is close to 0, one would like to terminate the experiment as soon 
as possible because of economic considerations. 

One can easily obtain reasonable tests which provide for early termination 
when H, appears to be true by splicing together one-sided tests. For example, we 
consider initially a modified repeated significance test of H,; p = 0 against H: 
u > 0 defined by the stopping time 


T, = inf{n: n > my, S, > bin!) 


with rejection of H, if T, < m or T, > m and S„ > cm'. Now consider adding 
a lower stopping boundary (cf. Figure 2) 


(1.7) T, =inf{n: n > Mo, S, S —b,n' + ên}, ê>0, 


and define a new test which stops sampling at T, A T, A m with rejection of H, 
if T < T, Am or T, A T,>m and S,, > cm!. (Here we are assuming that 
—b,m'/2 + ôm < cm!) Presumably ô is chosen to be a positive treatment 
effect which it is important to detect. Since one hopes to accomplish something 
different with T, than with T, there is no obvious reason that the lower 
boundary should have the same shape as the upper boundary, or if it has, that b, 
and 6, should have any particular relation. Nevertheless, for the convenience of 
this theoretical discussion, we assume that my, = my =m, and b, = b, = b, 
say. 


yy 


SS 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 367 


Sa 





bgn? 8n 
FIG. 2 
The power of this test is 
(1.8) P{T, s T, ^Am} + PAT, A^ T; > m, Sp 2 em'®}, 


which is difficult to compute exactly, but which usually is easily approximated 
by results developed to deal with (1.5). One approximation to (1.8) is 


(1.9) P{S, > cm} + P{T, < m, Sp < cm} — P{T, < m,S, = cm}. 
It may be shown that the difference between (1.8) and (1.9) is 
P{T, < T, < m, Sp > em'/*} — P{T, < T, < m, Sp < em'”}, 


which involves sample paths which first cross one stopping boundary, then the 
other, and have partially crossed the continuation region again by time m. These 
probabilities are usually insignificantly small unless m is close to the point where 
the upper and lower boundaries meet, in which case m can probably be reduced 
without adversely affecting the overall properties of the test (cf. Anderson, 1960). 
For the somewhat simpler case of a truncated sequential probability ratio test, 
Siegmund (1985b, III.6) shows that the corresponding approximation is a good 
one. 

For a numerical example consider the test of Table 2, which has a significance 
level of about 0.025 as a one-sided test against H,: u > 0. If we now introduce a 
lower stopping boundary (1.7) with 6 = 1.759 and decrease c slightly to 2.02, the 
approximation (1.9) indicates that the significance level of the new test is again 
about 0.025 and the power at u = 1.759 is still 0.97. At u = 0.994 the power is 
about 0.58, so introduction of the lower stopping boundary appears to lead to a 
negligible loss of power. On the other hand, the expected sample size when u = 0 
is roughly the same as the expected sample size in Table 2 for p = 1.759, or about 
2.8. This is a considerable reduction from the expected sample size of a repeated 
significance test, which is just slightly less than the maximum sample size, 
m= 5, 


368 D SIEGMUND 


In recent years various authors have attempted to define attained significance 
levels, or p-values, and confidence sets relative to sequential tests. Both of these 
concepts require that the possible outcomes be ordered so that one knows what it 
means to say that one outcome is more extreme than another. For example, 
suppose that we use the stopping rule (1.1) and the test terminates at T = n € 
(Mp m]. It seems reasonable to say that a more extreme result would be a 
sample outcome which terminates at this or a smaller value of T and hence to 
define the (two-sided) attained significance level of the observed result to be 
P {T < n}. By similar reasoning, one can define a confidence interval for u. For a 
lower (1 — a) 100% confidence bound, if T =n E€ (mẹ, m] and Sp > 0, we can 
take for a bound that value 4, which satisfies 


PATZ n, Sp > 0} ee! a 


The bound is defined similarly if T = my, T > m, etc. 

For b = 2.413 as in Table 1, the attained significance of T = 2 according to 
the preceding definition 1s about 0.027. Thus, in spite of the dramatic action of 
stopping the test after 40% of its projected duration the evidence against H, as 
measured by the p value is by no means dramatic. The situation is somewhat 
different for a modified repeated significance test 1f b is taken sufficiently large. 
For b = 2.71 as in Table 2, P{T < 2} = 0.012; and the attained significance 
of any result which terminates the test before time m= 5 is bounded by 
PAT < 5} = 0.023. Of course, it would defeat the purpose of using a sequential 
test if one insisted that the p value be extremely small before stopping the 
experiment. 

It seems difficult to give a persuasive theoretical justification for the defini- 
tions suggested here, and hence the principal argument in support of them is 
that several authors have independently arrived at essentially the same defini- 
tions. Berk and Brown (1978) discuss different alternatives. One is to order 
sample outcomes according to the value of A = S,,,,/T A m. Fairbanks and 
Madsen (1982) suggest yet a third definition. If there were no excess over the 
stopping boundary these three definitions would be equivalent. However, for a 
group sequential test with a small number of fairly large groups neglecting excess 
over the boundary presumably leads to a loss of precision. See the example in 
Section 1.3. Ordering outcomes by the values of X has the advantage of 
generalizing directly to the case where additional data become available after the 
experiment has nominally been terminated. Often these data will be a small part 
of the total sample and could be neglected. See Samuel-Cahn and Wax (1985) for 
an example to the contrary. 


1.2. Sequential survival analysis. The discussion of the preceding section is 
extremely simplified, and to see how it provides considerable insight for more 
realistic models, we consider next the possibility of using sequential methods in 
clinical trials involving survival data, analyzed by the proportional hazards 
model (Cox, 1972). The notation is unavoidably complicated. 

Suppose that patients arrive (are born) at times y,, y,,.... Associated with 
the ith patient is a triple (z,, x,,c,), where z, is a covariate, x, is the length of 


a 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 369 


survival (age at death), and c, denotes the time of censoring after arrival. The 


assumption of the proportional hazards model is that 
P{x, € [s,s + ds)jz,, x, > s} = dA,(8) = exp( Bz,)A(s) ds, 


for some unknown parameter 8 and baseline hazard function A. Also let 
R(t, s) = {i: y,<t-—s,x,Ac,2 s} denote those patients who are at risk at 
time £ and whose age (measured from arrival) is at least s. Let 


N (ts) =I{y, +x, s t,x, 5 ¢,,x, 5 8} 


be the indicator that the ith patient arrived and died before time t, died at an 
age < s, and was not censored at the epoch of death. Cox (1972, 1975) suggested 
that this model be analyzed by applying likelihood methods to the log partial 
likelihood function 


eB) = Ef, [Be o 2 exo(Bz,)]} (td). 


ERC, $) 


In particular, consider the score process (t, 8), = dl(t, B)/dB, or more gener- 
ally, the two parameter process 


(1.10) I(t, s, B) o T z, — walt, u)}N,(t, du) 


with 


t, = : 
Hl as Èe ru, uPXP( 8z, ) 


It is easy to see that Ì(t, 8) = l(t, t, B). The score process can be used directly to 
test the hypothesis H,: 8 = p, and its zeros yield partial maximum likelihood 
estimators of 8. The asymptotic distribution theory used for probability calcu- 
lations is based on the fact that under mild regularity conditions 


it, B)/{ W(t, B)}'? 


has asymptotically a standard normal distribution. (Here I(t, 8) is the second 
derivative of the log partial likelihood with respect to 8.) See Cox (1975) for an 
informal treatment and Gill (1980) for a sophisticated discussion based on 
martingale theory. 

The appropriate generalization for purposes of sequential analysis is that 
ict, B), when plotted against —J(t, 8) as the “time” parameter, behaves like 
standard Brownian motion. By virtue of the Taylor series expansion 


i(t, Bo) i l(t, B) T (8 7 Bo) {- i(t, Ba)} + o( B 7 Bo), 


we see that for 8 close to Bp, the test statistic /(t, 8,) plotted against { —I(t, By)} 
as time behaves like Brownian motion with drift 8 — a when £ is the true value 
of the parameter. (See Figure 3.) 

Formulating precisely and proving the claims of the preceding paragraphs are 
a substantial undertaking, which is not attempted here. Sellke and Siegmund 


370 D. SIEGMUND 


y= bx /2 
y =(8-Bo)* 


~ l it, Bo) 


Fic. 3 


(1983) give a fairly complete discussion under the additional assumption that the 
triples (z, x, c,) are independently and identically distributed. A still more 
difficult argument is required if, as seems desirable, one treats the z’s and c’s as 
ancillary and conditions on their values (Sellke, 1985). 

The reason that it is much more difficult to study the score function as a 
process in ¢ than marginally for fixed ¢ is briefly the following. By rewriting 
(1.10) as 


i(t,s,B) = Ly, (2 ~ n,(t,u)}[N(t, du) — Hie R(t, u)} dA,(u)], 


one can easily show that (1.10) is a martingale in s for each fixed t. Hence 
martingale central limit theory is tailor-made to study the behavior of (1.10) as a 
process in $ and in particular its marginal distribution for s = t. However, (1.10) 
with s = ¢ is not in general a martingale in £ (although it is in the degenerate 
case that all arrival times y, are the same). Sellke and Siegmund (1983) show 
that /(¢, t, 8) can, however, be approximated by a martingale uniformly in t; and 
they then apply martingale central limit theory to this approximating martingale. 
sellke (1985) observes and exploits the fact that for t, < t, < t; < t, 


i(t,,s,B) — U(ts, s, B) and (t, s, B) — Ut, s, B) 


are orthogonal martingales in s. 

For convenience data monitoring committees customarily meet at roughly 
equal intervals of calendar time (e.g., every six months). According to the central 
limit theory discussed above, ¢ units of “experimental” time is proportional to an 
increase of ¢ units in the observed Fisher information process regardless of how 
fast or slowly this occurs in calendar time. Siegmund (1985b) describes 
a Monte Carlo experiment, which among other things indicates that this 


wd 


x 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 371 


discrepancy may not be important—at least if the arrival and censoring mecha- 
nisms are not too erratic. 


1.3. Example. A sequential clinical trial which has recently been described in 
considerable detail in the medical-statistical literature 1s the randomized trial of 
propranolol conducted by the -Blocker Heart Attack Trial Research Group (cf. 
BHAT, 1982; DeMets, Hardy, Friedman, and Lan, 1984). Over a period of about 
27 months 3837 victims of acute myocardial infarction were randomized to a 
placebo group (1921) or a treatment group (1916). The principal endpoint was a 
survival time, which was assumed to follow a proportional hazards model. 

The data monitoring committee planned reviews of the results to date at 
f= 1, 1.5, 2, 2.5, 3, 3.5, and 4 years. It was assumed that these would correspond 
to seven reviews at approximately equally spaced increments of increase in the 
observed Fisher information. The stopping rule used in this experiment was 
defined by parallel straight lines as in Remark 1.6, but for illustrating the theory 
developed here, we shall consider a modified repeated significance test. (For a 
discussion of the various factors in addition to the stopping rule which went into 
the actual decision to terminate the experiment, see DeMets et al., 1984.) 

Let t, denote the time of the nth planned inspection, n = 1,2,...,7, and 
consider the stopping rule 


(1.11) T = inf{t,: n > my, |(t,,0)| > b[-U(t,,0)]'}. 


To test the hypothesis H,: 8 =0 of no treatment effect, stop sampling at 
min(T,t;) and reject H, if either T < t, or T > t, and |l(t,,0)| > c{[—Ut,,0]'”. 
The normal approximation described above indicates that m, = 2, m=7, b= 
2.65, and c = 2.05 yield a 0.05 level test having a power function very close to 
that of the sequential design used in BHAT (1982). The power function and 
approximate expected sample size for this test in the approximating normal 
model are given in Table 4. For comparison the test actually used is also 
included. 

To relate the power as a function of the normal mean p to the parameter £, it 
is necessary to make some assumptions about the rate of increase of — l(t, 0). For 


TABLE 4 
Approxunaiing normal model 


Modified repeated Test defined in 
significance test? Remark 1.6” 
Expected Expected 
p Power sample size Power sample size 
15 0.97 3.59 0.97 409 
1.25 0.90 447 0.90 4.95 
0.75 0.50 * 0.50 * 
0.00 005 * 0.05 * 


“h = 2.65, c= 205, my = 2, m= 7 
DI = 548, m= 7 


B72 D. SIEGMUND 


the sımple model we have discussed, if the censoring mechanism does not depend 
on the covariate, it is easy to see that for 8 close to 0 each death yields on the 
average about + unit of information. In the 3.5 years before this experiment was 
terminated there were 318 deaths for an accumulation of approximately 79.5 
units of information, or an average of 13.3 units per inspection period. This 
means that a value of u in Table 4 corresponds roughly to a value of 8 = 
p/(13.3)'7*. In particular the row for u = 1.25 in Table 4 corresponds to 8 about 
equal to 0.34. (The discussion of sample size selection in BHAT, 1981, shows the 
expectation before the experiment began of a somewhat more rapid rate of 
accumulation of information, hence greater power, than actually occurred.) 

Similarly, the expected sample size in Table 4 multiplied by an average rate of 
accumulation of information gives the expected information until termination of 
the experiment. This in itself may not be as meaningful as. for example, the 
expected number of deaths or the expected real time of the experiment. If we use 
the approximation that information equals + the number or deaths, then ex- 
pected information is directly proportional to a more meaningful quantity. 
Without much stronger modeling assumptions, involving the arrival rate and the 
baseline hazard function, there is no relation between expected information and 
expected real time for the experiment. Qualitatively, information accumulates 
more slowly early in the experiment, so a reduction in expected information of 
50% compared to, say, a fixed sample test, invariably means a smaller reduction 
in the expected time of the experiment. 

The observed values of 1/(—/)'/? at 1,1.5,...,3.5 years were respectively 1.68, 
2.24, 2.37, 2.30, 2.34, and 2.82 (DeMets et al., 1984). For the test actually used 
and also for the repeated significance test suggested above, these data lead to 
termination of the experiment. at ¢ = 3.5 years, or six months before the final 
planned inspection. (A more detailed analysis using a number of covariates gave 
the value 3.05 for the corresponding normalized statistic, which is reasonably 
consistent with the 2.82 of the simplest possible model. See BHAT, 1982.) The 
values of /(t,0) and —/(t,0) are not given separately, so it is not possible to plot / 
against ~2 as in Figure 2. This is unfortunate because such a plot would allow 
one to check whether inspections indeed occur at approximately equal incre- 
ments of increase of information; and much more importantly, since the plot 
should be approximately a straight line with slope £, it would give a visual 
estimate of £ and a visual goodness of fit test of the proportional hazards model. 

According to the definitions ziven in Section 1.1, the two-sided p value of 
T = 6 in the approximating normal model is P,{T < 6} = 0.023, and an 80% 
confidence interval for u is (0.44,1.7). (The upper limit is given incorrectly in 
Siegmund, 1985b, p. 134.) If one were to take excess over the boundary into 
account these figures would change to 0.022 and (0.47,1.6). The approximate 
relation between u and 8 described above allows one to convert this confidence 
interval for u to an approximate confidence interval for 8. 


2. Boundary crossing probabilities. 


2.1. Introduction and asymptotic normalization. In this section we consider 
the mathematical problem of calculating approximately probabilities like (1.5). 


wnt 


P» 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS — 373 


Let x,,X,,... be independent and identically distributed, and set S, = 
x, +e +x, Let c(n), n = 1,2,... be constants and ma < m positive integers. 
Define the stopping time 


T =inf{n: n > mo, S, 2 c(n)}, 


and consider the problem of evaluating 


(2-1) P{T <m} 
or 
(2.2) P{T < m|S„ = £}. 


Since P(T < m} = P{S,, = e(m)} + [UPPT < m| Sn = §}P(S,, E dé}, and since 
the distribution of S,, is comparatively easy to evaluate, at least approximately, 
a good approximation to (2.2) usually yields a good approximation to (2.1). 
Similarly, evaluation of (2.2) is frequently the principal ingredient in calculating 
(1.5). Hence our focus in what follows is on developing approximations to (2.2). 


REMARK. In some contexts conditional probabilities of the form (2.2) are of 
interest in their own right. Some examples from fixed sample statistics appear in 
Part 3. The methods described below are also suitable for approximating 


P(T < mSpy =} = E[P{T <m|S,}|Sy=€], m<QN, 


which arises in problems of sequential sampling inspection for fraction defective 
in a lot of size N, when the maximum sample size is a nonnegligible proportion 
of N. 


Since (2.2) can only rarely be evaluated exactly, it is convenient to imbed our 
problem in a sequence of problems and seek an asymptotic approximation. The 
actual calculations are preceded by some remarks about the two most obvious 
asymptotic formulations. 

In problems scaled for large deviations, we consider the asymptotic evaluation 
as m — oo of probabilities of the form 


p(m) = P{S, > me(n/m) forsomem,<n<mlS,=§&}, = mép. 


Since the boundary me(n/m) is O(m’) standard deviations away from the 
(conditional) mean path of S,, these probabilities typically converge to zero, and 
a reasonable approximation would be of the form p(m) ~ g(m) for some easily 
evaluated analytic expression q( m). 

An alternative, the ordinary deviation or diffusion scaling, suggests considera- 
tion of 


p’(m) = P{S, = m'7c(n/m) for some mg < n < mS, =}, E= mn. 
Now the mean path of S, is O(1) standard deviations from the boundary 
m'/“c(n/m), so typically if m/m > ty 

p'(m) > p = P{ W(t) = c(t) for some ty < t < 1{W(1) = £o}, 


where W(t), 0 < t< œ, is a Brownian motion process. The approximation of 
pim) by p is often not particularly good, but it can be improved by finding an 


374 D SIEGMUND 


expansion of the form 
p'(m) =p + pm” + o(m™'®), 


which has been called a corrected diffusion approximation (cf. Siegmund, 1985a, 
and references cited there). 

Typically, large deviation approximations are more easily obtained than 
corrected diffusion approximations. This is especially true for nonlinear 
boundaries, c(n). See Hogan (1984) for the first corrected diffusion approxima- 
tions in a nonlinear case. Occasionally it is possible to write a single approxima- 
tion which is applicable to both cases. When this is so, that approximation is 
usually a very good one. Except for a few remarks, only large deviation scaling is 
considered in what follows. 

Numerous methods have been invented for approximating boundary crossing 
probabilities (e.g., Borovkov, 1962; Woodroofe, 1976b; Lai and Siegmund, 1977; 
Daniels, 1974; Jennen and Lerche, 1981; Durbin, 1985). The method described 
below has the virtues that it is essentially the same in both discrete and 
continuous time, it is fairly general, and it yields exact results in most of the 
simple situations where exact results can be obtained. Our starting point is a 
derivation of the standard reflection principle for Brownian motion. The argu- 
ment is then incrementally modified to deal with problems in discrete time and 
problems involving nonlinear boundaries. Woodroofe (1982) contains an exposi- 
tion of alternative methods supported by complete proofs. 


2.2. Reflection principle for Brownian motion. Let W(t), 0<t< œ, be 
Brownian motion with drift u and unit scale parameter, and let F(t) denote the 
o-field of events defined by Wis), 0 < s < t. It will be convenient to use the 
notation 


Pir'( A) = P{A|Wim) =}, AEF(m). 


By the sufficiency of W(m), this conditional probability does not depend on u. 
For all €, # $, and ¿< m, the probabilities P/”’ and P{™ when restricted to 
F(t) are mutually absolutely continuous; a straightforward calculation shows 
that the likelihood ratio of W(s), s < ¢, under pm relative to Pym is 


(2.3) L(t, W(t); £ £2) = SIG ~ $a) W(t) ~ ~ (g i E| em = D); 


The following is a version of Wald’s likelihood ratio identity, which can be 
proved by standard martingale arguments. 


PROPOSITION 2.4. For any &, + &, m > 0, stopping time T and event A € 
F(T) 


PLA {T < m}) = EL? [LT, WT); £st); AN (T < mj], 
where ['"? is given by (2.3). 


mA 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 375 


Let b > 0, —a <n < œ, and define 
7 =inf{t: W(t) > b+ nt}. 
Let & =§ < 6+ nm, and let é, = 2(6 + nm) — € be é reflected about b + nm. 
(See Figure 4.) Since W(t) = 6+ nr on {r < m} and PEM Cr < m} = 1, from 
Proposition 2.4 and simple algebra one obtains the well-known result 
(2.5) Pi {t < m} = exp|-2b(b + nm — ¢)/m}. 


Siegmund and Yuh (1982) show how a slightly more sophisticated version of 
this argument yields Anderson’s (1960) results. 


2.3. Correction for discrete time. Consider now the same problem in discrete 
time, so 


t= inf{n: S, 26+ nn}, 
where S, =x, + ++: +x,, and under P, the x’s are independent normally 


distributed random variables with mean p and variance 1. Now the preceding 
argument yields 


PUTE m}exp|2b(b + nm — £)/m] 


= E{” [exp{ —2(b + nm - £)[S,- b-n7]/(m-1)};7< ml, 
where $, = 2(b + nm) — &. 

To analyze the right-hand side of (2.6) asymptotically, suppose that b = {m 
and £ = mé, for some fixed £ > 0 and & < Ẹ + n. Since the Pe deviations of S, 
from its expectations, [2(f + 7) — §,]n, are of order n!, a law of large numbers 
argument shows that with probability approaching 1, S, crosses the line {m + nn 
near where its line of drift does, so for any e > 0 
(2.7) lim PL {mo '7 — §(25 + n — £o) ‘| > e} = 0. 


nto 


(See Figure 4.) It follows that the right-hand side of (2.6) has the same 


(2.6) 


(m,2(b+7m)-§) 


(m, b+7m) 


(m,) 





((2E+-Eo)'tm,0) — (Mo) 


Fic. 4 


376 D. SIEGMUND 


asymptotic behavior as 
E{™ [exp{ —2(25 + n — £,)(S, — fm — nt); 7 < m]. 


If this expectation were with respect to the unconditional probability with the 
same drift, Py: +,)-¢,, one could apply the renewal theorem in the manner which 
Feller (1971, XII) uses to derive Cramér’s estimate for the probability of ruin, 
and hence evaluate (2.8) in the limit as m —> oo. Specifically, observe that for a 
random walk Ss n=1,2,..., with nonnegative drift fi = ES, and for 


7 =inf{n: S, > a}, 


S, — a can be regarded as the residual lifetime in a renewal process defined by 
S; , where 7, = inf{n: S, > 0}. Hence if S, is nonarithmetic the renewal theo- 


T 3 


rem implies abana CO, 
(2.8) P{S,-a<x: > |E(Š; D [Pl {S; > y} dy. 


See Feller (1971, XI and XII). For a discussion which is oriented towards the 
present application, see Siegmund (1985b, VIII). 

During the relatively short time interval in which according to (2.7) 7 falls 
with probability close to 1, the increments to the conditional P;” process S,, 
and the unconditional P,,,,_, process both behave essentially the same, sO the 
Pe? and Py; ,,)-¢, limiting diseabutions of S, — {m — nr are the same, and are 
given by (2.8) with S, = S, — qn. 

One simple way to make this argument precise is to obtain a slightly different 
version of (2.6) by using Wald’s likelihood ratio identity to differentiate P{”” 
with respect to Py:4,)-¢, instead of Pi”). Let py = €)/m = 2(f +n) — ġo. An 
easy calculation shows that the likelihood ratio of x,,..., x, under P{” relative 


to P,, is 
| S ne 
na oan 
2 m-n m 


so the right-hand side of (2.6) equals 


E 


He 





exp| -2 +a O O= m)/(ı g 5 


1 B 7 \!/2 | 
— — — 1-2) srTr<m]. 
2 m-—rT m 


The asymptotic marginal distributions of the random variables appearing in 
(2.9) are easily determined. Under P, , t/m converges in probability to the same 
limit as in (2.7); the renewal theorem applies as in (2.8) with S, = S, — nn — b; 
and by an easy application of Anscombe’s theorem, (S, — po1)/7!”* is asymptoti- 
cally standard normal. Also by Lemma 2.16 below, S.— nr — b and (S_— 
j4.T)/7'/* are asymptotically independent. Thus we have all the ingredients to 
evaluate (2.9). From (2.8) and some calculation one sees that the limit of (2.9) is 


(2.9) 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 377 


y[2(2f + n — &)], where for u > 0 and r, = inf{n: S, > 0} 

(2.10) p(p) = [1 = E, pæxp( —uS, )|/nE, »2(S,. ). 
Hence by (2.6) 

(2.11) P(r <m} ~ »[2(2 + 7 — y)Jexp[—2mg(f + 9 ~ £0) 


(b= mf, €=még, §>0, o< +n) Random walk theory (cf. Feller, 1971, 
XVIII or Siegmund, 1985b, VIII) permits one to obtain a numerically calculable 
expression for (2.10), to wit 


(2.12) p(p) = 2u exp] -25n 0- Inn”) ; 


where ® is the standard normal distribution function. For many purposes it 
suffices to use the approximation 


(2.13) v(u) = exp(—pu) + o(p’), 0, 
where 
p = E,S2/2E,(S,) = -a!f A7*log|2A-2(1 — exp(—22/2)}] dà 
adi 9S? /2E,(S,.) fd Plog[2a~*(1 — exp(-X*/2)}] 
~ 0.583. 


Partial justification for (2.18) comes from a Taylor series expansion of (2.10) to 
obtain 


(2.15) v(u) =1—pE, (S?)/2E, a(S) + 


This is easily turned into a proof of (2.13) with an error o(#). That the error is 
actually o( 7) and that p has the value given in (2.14) are more difficult to prove. 
See Siegmund (1985b, X) for details. 

To complete the proof of (2.11), we must justify the asymptotic independence 
of (S, —py7)/7'” and S, ~ 47 — b used in evaluating the limit of (2.9). The first 
person to have noticed this relation appears to have been Stam (1968). 


LEMMA 2.16. Let S,, n= 1,2,... be a nonarithmetic random walk with drift 
E(S,) = i > 0 and finite variance ã? = var(S,). Let 7 = 7(a) = inf{n: S, > a}. 
Asa => æ, forallx>0,-0 <y<o 

P(S, ~a<x,(7—- ap')/(ae%p-3)'"” < y} — H(x)®(y) 
and 
P(S;— a < x,(S; — ft) /7'? < y} > H(x)®(y), 
where H is the distribution function given in (2.8). 


REMARK. A similar result holds for arithmetic random walks, but the 
distribution H is slightly different. A result corresponding to the first relation in 
Lemma 2.16 holds if ñ = 0, but in this case the appropriately normalized 7 is 
7/a”, and ® must be replaced by 20( y~'”*) — 1. 


378 D, SIEGMUND 


PROOF OF LEMMA 2.16. Since by (2.8) 
ca a (a= par 50, 
the second asymptotic relation follows from the first. To prove the first relation, 
let n = n(a, y) = af ' + y(aG7i7?)'”. Then 


P{8.-asx,7>n}=E|P{S,-asxl?>n,S,};7>n]. 


Suppose a, < a, a — a, > %, but a — a, = o(a'””). Then by the central limit 
theorem 


E|P{S,-a<x|?>n,S,};7>n,a,<S,<a] < Pla, <S,<a} 70. 
Also, uniformly on {F > n, S, < a}, by (2.8) 
Pf San a si> n, S =)= P Siang aaz) $x) > Hle). 
Hence 
P{Š -a <x, ï> n} = H(x)P{7>n,8, <a} + o(1) 
= H(x)P{7 > n} + o(1). 


The lemma follows from the well-known and easily proved asymptotic normality 
of 7 with the indicated scaling. 


Using (2.13), one can rewrite (2.11) in the form 
(2:17) P:™ {r < m} = exp{—2(b+ p)(b+ p+ ym— £)/m}. 


This last approximation is particularly interesting because it is of the form (2.5) 
with b replaced by b + p. Moreover, it follows from (2.8) and (2.13)-(2.15) that p 
is approximately the expected excess of the discrete random walk over the 
boundary, so (2.17) has the interpretation that to correct for discrete time one 
can use the Brownian motion result (2.5) with boundaries displaced by the 
average amount the discrete time process jumps over the boundary (cf. 
Siegmund, 1985a, for other results having a similar interpretation). 

The approximation (2.17) is also valid as a corrected diffusion approximation, 
ie, if b = {m'/, y= m7 '”, and £ = £,m'”, the difference between the two 
sides of (2.17) is o(m™'/*). This result can be proved along the lines of the 
argument sketched above; but the details are more difficult because the Pj”) 
distribution of m~'r does not become degenerate as m — oo. See Siegmund 
(1985a). 

The accuracy of (2.17) is quite good. For m = 3, b = 1.564, and 7 = 0 Worsley 
(1983) has numerically calculated P{™{r < m} to be 0.05. The approximation 
(2.17) yields 0.0463. For the corresponding comparison when m = 5, b = 2.165 
(m = 10, b = 3.292), (2.17) gives 0.0488 (0.0496). It is perhaps worth observing 
that an uncorrected diffusion approximation is very poor for small m—ranging 
from 0.1 to 0.2 for these examples. 

The asymptotic relation (2.11) can be generalized to a large class of random 
walks whose distribution can be imbedded in an exponential family (Siegmund, 
1982). The analogue of » given in (2.10) can be computed numerically using 
results of Woodroofe (1979) or by an approximation along the lines of (2.13). 
With some technical improvements the method also works for a general class of 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 379 


nonlinear boundaries. The key is (2.7), which suggests that if the boundary is to 
be crossed at all, it will be crossed close to some distinguished point. This further 
suggests that one try to approximate the boundary by its tangent at this 
distinguished point, which can be determined as the point through which the 
P§™ line of drift passes when £, is appropriately chosen for the linear problem 
of the tangent line. Siegmund (1982) discusses the example of repeated signifi- 
cance tests in detail. 


2.4. Repeated significance tests. For repeated significance tests In exponen- 
tial families a slightly modified method requires considerably less algebraic 
detail. We continue to consider the case of normally distributed observations, 
and let T be defined by (1.1). 


THEOREM 2.18. Suppose b > œ, m > œ and m, > œ in such a way that 
for fixed 0 < p, < py < co, bm =p, bm = py. Let 0 < Jë | < u, and 
g = més). Then PLT < m} < (Hoti )exp[— ym pi g £o); and for Hi” Bo i 
ISo] < Hy 

P(T < m} ~ o( ut /lEol)waléol~ 'exp| — 4m( ui - €)], 


where v is given by (2.12). 


REMARK. The restriction that [| be a fixed number in (7/9, #,) is for 
ease of exposition. The cases $) = p, — y/m and & = (p?/py) + y/m'” arise 
naturally (Siegmund, 1985b, IX.3 and XI.1), and fixed [&,| € (0, u/u,) can also 
be considered. A similar remark applies to Theorem 3.11 below. 


COROLLARY 2.19. Suppose that the asymptotic scaling of Theorem 2.18 holds 
and also cm~"? = y E (u?/to, ki). Then as m > œ 


P{T < m,|S,,| < cm'”} 


~ 2bp(b)exp(-mu?/2) f” x~ 'cosh( b*y/x)v(x) dx. 


Er Y 


(2.20) 


For u + 0 the right-hand side of (2.20) is 
— o[m (m — lel)] 
|p|!” 


Here ọ denotes the standard normal density function and v is given by (2.12). 
The relation (2.20) also holds when u = 0 and c = b (y = p,), or if my = o(m) 
as b > 2, 


(2.21) my 'e(pey-'Jexp[—mlul|(p, — y)]. 


REMARKS. Corollary 2.19 suggests that one approximate (1.5) by using (2.20) 
or (2.21) as an approximation for the second term on the right-hand side. These 
are in fact the approximations used to compute the entries of Tables 1-4. 
Strictly speaking (2.20) is not a true asymptotic relation when u + 0 and c = b, 
but there is a heuristic argument for using it in this case as well. See Siegmund 
(1985b, [X.3). For u = 0, (2.20) presents a simple numerical problem and is easily 


380 D. SIEGMUND 


tabled (Siegmund, 1985b, IV.3). For u # 0 (2.21) is easily evaluated with the aid 
of (2.13)-(2.14). For u close to 0 (2.20) provides a more accurate approximation, 
which can be useful when determining confidence bounds. A small disadvantage 
to (2.20) when u # 0 is that a rather extensive (two-dimensional) table must be 
constructed if one does not want to be dependent on access to computing 
facilities. 


Corollary 2.19 follows easily from the theorem, and some simple estimates 
which are omitted here (cf. Siegmund, 1985b, IX.3). A proof of Theorem 2.18 
follows. 


PROOF OF THEOREM 2.18. First observe that in the derivation of (2.5) one 
could pretend that time flows from m to 0 instead of from 0 to m and reflect the 
value W(0)=0 to W(0)= 2b instead of reflecting Wim)=é to Wim) = 
2(b + ym) — & Also recall that in the derivation of (2.11) it was convenient to 
work with the unconditional probability with the same drift as Pi”. 


Let 
Py"(A) = P,A|S, = À, Sm = £), AEF 


m) 


and put 


B(A) = f PERAL ~ E)/m 2] dd/m'? 


Note that if we regard the process as running backwards from an initial value 
S,, = § then under P{”™ it is normal random walk with zero drift. 
Let T* = sup{n: n < m,|S,| 2 bn'/*}, so 
PPT < m) = PROT * > mo}. 
It is easy to see that the likelihood ratio of x,,,,...,%,, under P{”? relative to 
Pye is (m = n, S, — £; à — $, —Ẹ), where im) c given by (2.3). This sim- 
plifies to 
exp|AS,/n — 7/2n — AE/m + N/2m). 


Hence by a straightforward P one sees that the likelihood ratio of 
x , Xm under P{" relative to PUP 


f ° EA — X /2n — A/m + E — E)/m! | dd/m'? 
(2.22) ° * 


TE E 


= (n/m) exp|S2/2n — £?/2m]. 
Since T* is a stopping time for the process running backwards from time m to 
time 0, Wald’s likelihood ratio identity yields the representation 


PUM < m) = PLY T* = mo) 


Ef" (m/T*) “exp| -S}./2T * + é?/2m|;T* > my} 
exp[~—4(b* — £?/m)| 

x EL (m/T*)'“exp| — 4(S2./T* — b*)|; T* > my}, 
where E{” denotes expectation with respect to P;”. 


(2.23) 


!l 


£23516 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 381 


The inequality in Theorem 2.18 is an immediate consequence of (2.23). To 
prove the asymptotic relation, it remains to evaluate asymptotically the expecta- 
tion in (2.23). 

Observe that the pa joint distribution of (T *, S,.) is the same as the P, 
joint distribution of (m — T*, E + S..), where 


Tt = inf{n: E+ S| => b(m — ny}. 


Hence the expectation on the right-hand side of (2.23) equals 


(2.24) Ba = r/m) Pax ae (Et Se)” -e PES M mo. 


2 m-7* 
An easy law of large numbers argument shows that as m — oo 


(2.25) m-'r* > 1- (iye Y 


in probability, and in particular 


1 for [| > K?/ hy 


(2.26) Pi{r* <m- my} > | | 
: 0 for |o] < p2/py. 


If we were dealing with Brownian motion, for which there would be no excess 
over the boundary, this would complete the argument. For the discrete time 
process, after using (2.25) and (2.26) in (2.24), it suffices to show that 


(2.27) lim Egexp| ~4(€ + S,x)’/(m — 1*) ~ wim]} = v( ni /Eo), 


which requires a renewal theorem (cf. (2.8)) for nonlinear functions of a random 
walk. 
To verify (2.27) observe that 7* can be expressed 


(2.28) 7* = inf {n: n 2 1l, tpz 'n + S,+4m't>'8? > lé5'm( 2 - Ez) }. 


If the term involving S? did not appear in this expression, the renewal theorem 
would give us the limiting distribution of the excess over the boundary. Because 
of (2.25) it seems plausible that in the relatively small interval of time into which 
t* falls with probability close to 1, the quadratic term m7'S? is effectively 
constant and hence has no effect on this limiting distribution, Lai and Siegmund 
(1977) describe a general class of processes which can be decomposed into the 
sum of a random walk and a term which varies sufficiently slowly that the 
limiting distribution of excess over the boundary is determined by the random 
walk alone via (2.8). See Appendix 2 for an informal discussion of nonlinear 
renewal theory. The consequence is that 


im P me t +S. + dm7 i S= me $o) s x} = H({x), 


where H(x) is the limiting distribution as given in (2.8) for a random walk S, 
having normally distributed increments with mean }y?¢j' and variance 1. With 


382 D. SIEGMUND 


the aid of (2.25) it is easy to convert this limiting result to 
lim P,{ i(m — 1*) (mo + S,)? — dim < x} = H(xfpy?). 
#3? 


A trivial change of variable yields (2.27) with the same function » that appears 
in (2.11) as the limit of (2.9). 


The method described above generalizes in a straightforward fashion to 
repeated significance tests in one-parameter exponential families. For the much 
more difficult multiparameter case, see Woodroofe (1978) and Lalley (1983). Hu 
(1985) shows that the present method leads to simplifications and new results in 
the multiparameter case, especially when there is some Invariance present. 

In the case of Brownian motion the preceding argument can easily be shar- 
pened to yield a second-order term in an asymptotic expansion of P{"{T < m} 
or P{T < m,|W(m)| < cem}. When there is no excess over the boundary the 
only approximation involved in the preceding argument is that of replacing 
(1 — 7*/m)~'/* by its limit as given by (2.25). To obtain the next order of 
approximation, it is only necessary to expand (1 — t*/m)~!” in a Taylor series 
and analyze its central limit behavior. Although simple in principle, the calcula- 
tion is quite complicated in detail because one must consider three cases: £, close 
to the endpoints of the interval (u?/Ho, n,) and & in the interior of this interval. 
Siegmund (1985b) shows under the conditions of Corollary 2.19, for T defined by 
(1.1) with Brownian motion W(¢) instead of S_, for p?/p. < Y S H 


P,{T < m,|W(m)| < em} 
(2.29) = (b — b-')@(b)log{ me?/m,b’ ) 


+b~'p(b)|3 — (be~+)?] + o( b-'p(b)). 


Miller and Siegmund (1982) discuss the history of the special case c = b of 
(2.29), which has been given incorrectly several times in the literature. 

Using methods introduced by Woodroofe (1976b, 1982), Woodroofe and 
Takahashi (1982) obtain the comparable approximation for PT < m} in the 
discrete case. The result is quite complicated and does not appear to yield 
generally more accurate approximations than the one suggested here (i.e., the 
sum of (2.20) with c = b and P,{|S,,| = bm] = 2[1 — ®(b))). 


3. Other boundary crossing problems. In Part 3 we consider a number of 
somewhat related (fixed sample) statistical problems which involve boundary 
crossing probabilities. For historical reasons the Kolmogorov—Smirnov and 
Anderson—Darling statistics are discussed briefly in Section 3.1. Section 3.2 is 
concerned with the mathematically similar but conceptually different problem of 
maximum x? statistics. Sections 3.3-3.6 on change-point problems are the 
primary focus of the chapter. (These sections can be read independently of the 
first two.) As we shall see, the methods of Part 2 occasionally deliver an 
appropriate approximation immediately, sometimes additional work is required, 
and sometimes completely new methods are needed. 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 383 


3.1, Kolmogorov-Smirnov and Anderson—Darling statistics. Let uy, Uy,... 
be independent and uniform on [0,1], and let 


F(x) = eB) mee 
l 


be the empirical distribution function. As stated in the introduction, essentially 
the first boundary crossing problem in statistics is that of finding the distribu- 
tion of the one-sample Kolmogorov-Smirnov statistic, 


sup [x ~ F,(x)]. 


The distribution can be evaluated exactly (e.g., Birnbaum and Tingey, 1951), but 
the result is quite complicated. From the representation of the uniform order 
statistics as W,/W,,,, k = 1,2,...,n, where W, = y, + --- +y, with Yi y,..- 
independent standard exponential, it follows that 
P! sup [x — F (x)| > ‘| = P max |W, -j| > ný — 1|W,,,— (n+1)= =i}: 
The methods of Part 2 yield a large deviation and a corrected diffusion ap- 
proximation, both of which are very accurate. See Siegmund (1982), Yuh (1982), 
and Siegmund (1985a). Of course, the limiting distribution is given by (2.5) with 
£=0, m= 1, and b = {n'”, but it is not a particularly good approximation for 
small n. 

Since the Kolmogorov—Smirnov statistic is insensitive to departures in the 
tails from the hypothesized distribution, Anderson and Darling (1952) proposed 
the goodness of fit statistic (two-sided alternative) 


(3.1) n! sup { F(x) — x|/[x(1 —- eae O<e<1l-e, <1, 

4,5NS1l-F, 
and observed that the asymptotic distribution of (3.1) as n > oo is that of the 
random variable 
(3.2) max |Wo(t)|/[¢(1 — ¢)]'””, 

esfsl-e, 

where W(t), 0 < ¢< 1, is a Brownian bridge. It is immediately verified by 
checking the covariance function that 


Wi) = (1+ Wta +t), O<t< 0, 


is a standard (driftless) Brownian motion process, so 


Pl max (WO - £)? > o) 
Sisler 
(3.3) . 
= l max t'w) > b). 
e(l—e,) 'stsez'(1—e) 
Hence the asymptotic significance level for the Anderson-Darling statistic 
equals the significance level of a repeated significance test for the drift of 
Brownian motion. In principle, one can compute (3.3) exactly (e.g., DeLong, 


384 D. SIEGMUND 


1981), but since the answer is very complicated and is only a crude approxima- 
tion to the probability of interest, a good and easily evaluated approximation 
seems preferable. (Rather than work with the limiting distribution, i.e., that of 
(3.2), one might attempt to give a large deviation approximation to the distribu- 
tion of (3.1). This seems a relatively straightforward problem; but a similar 
analysis of the corresponding two-sample statistic, which appears in Section 3.2 
in a disguise, appears quite difficult.) 

Since for any r > 0, r7'/*W(rt), 0 < t < œ, is again a standard Brownian 
motion, it follows that 


Pl max t W(t) > b) 


usiz: 


depends only on the ratio vu~ ', not the actual values of u and v. Consequently 
by (3.3) and (2.29) with c = b, as b > œ 


P{|W,( 2) > b[t(1 — t)]'” for some e, < t< 1 - Ez} 


(3.4) = (b — b`')p(b)log[(1 — ,)(1 — €,)/e,e,] 
+4b7'p(b) + 0( b-'p(b)). 


Comparison of (3.4) with the exact numerical computations of DeLong (1981) 
shows that it is quite accurate, even when the probability is not close to 0. 


3.2. Maximum x? statistics. The random variable (3.2) arises as a limit in 
distribution in a context which at first appears to be quite different from the 
Anderson- Darling statistic. 

Suppose that a 2 x 2 table is obtained from a categorical variable A or A‘ 
(not A) and a dichotomized quantitative variable Y, which divides a population 
according to low (Y < y) and high (Y > y) values of Y. See Figure 5. The 
situation might arise if A (A‘) denotes the occurrence (nonoccurrence) of some 
event or presence (absence) of some disease and Y is a diagnostic predictor of the 
event or disease. We seek a cut point y*, which divides the population into low 
risk and high risk groups. 

An apparently common ad hoc procedure for choosing y* is obtained by the 
following reasoning. For a given value of y, one measure of dependence between 


Ysy Y>y 


Kia. 5 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 385 


the categories A and Y < y is the x” statistic 


: N(ad — be)’ 
a (a+ b)(c+d)(at+c)(b+d)’ 


and larger values a x? , indicate a larger degree of dependence. Hence we choose 
y* to maximize y? y (subject to nepe some minimal percentage of the total 
sample in the Y < y* and Y > y* categories). To ascertain whether the observed 
value of the maximum x? statistic might reasonably have arisen from chance 
fluctuations, one would like to know the distribution of max, x2 under the 
hypothesis that A and {Y < y} are independent for all y. 

Let PCy) = P{Y < y|A}, FA(y)= P{Y s y|A‘S}. The natural nonparametric 
estimators of F, and F, are 


F(y)=a/(a+b) and Ê(y)=c/(c+d). 


The hypothesis of independence in the 2 x 2 table for all y is Hy: F, = Fy, and 
Fi y)=(at c)/N estimates the common distribution ae) mde H,. In 
terms of F, F,, and Ê, the square root of the maximum y* statistic is 

j /2 


where n, =a + b, n, =c + d. This is the natural definition of a two-sample 
Anderson-Darling statistic, which under H, converges in law to (3.2) as 
min(7,, n) — co. See Miller and Siegmund (1982) for a more complete discus- 
sion and numerical examples. 

Although the probabilistic aspect of this problem is already solved, natural 
and simple generalizations to deal with more than one predictor variable di- 
chotomized, say, by a hyperplane seem extremely difficult. See Halpern (1982) 
for a more precise formulation and Monte Carlo study. 


1 1 
asec + a 
Nn, My 


maxx, = max — |F,(y) - AC AID - F(y)] 


ti shl(y)sl-e, 








3.3. Introduction to change-point problems. In these final four sections we 
shall discuss detection and estimation of the time(s) of an abrupt change in the 
distribution of a sequence of observations x,, x.,... . To simplify the discussion, 
assume that the x, are independent and normally distributed with means ut” 
and variance 1. Change-point problems appear to have arisen originally in the 
context of quality control, where one observes the output of a production process 
sequentially and wants to signal any departure of the average output, from some 
known target value u. Outstanding contributions in a long line of papers on 
sequential detection are Page (1954), Shiryayev (1963), Lorden (1970), and Pollak 
(1985). 

In the following we consider only fixed sample problems involving a finite 
sequence X,, X2,...,X,,- The specific problems to be discussed are to test tne null 
hypothesis of no change Hy: p? = --- = yu!” against the alternatives of exactly 
one change, 


H,:41 <p < msuch that p™ = eee = pP = uo Z p, = pet) = one = pm), 


386 D. SIEGMUND 


or against the epidemic or square wave alternative, 
H,:41 < p, < p, < m sguch that p = --- = pP) = po, 


pert) = tte = piP2) = Ho + Ô, eet = eer = ae = Ho- 


We also consider estimation of p by a confidence set when the hypothesis of 
exactly one change is assumed to be true. Typically pọ and ô = u, — py are 
unknown, but it sometimes seems reasonable to suppose that a particular value 
of ô is a minimum threshold of interest and hence to regard 6 as known for the 
purpose of deriving a test statistic. 

Examples of change-point problems in epidemiology are described by Worsley 
(1983) and by Levin and Kline (1984). Here one is interested in testing whether 
the incidence of a disease has remained constant over time, and if not, in 
estimating the time(s) of change(s) in order to suggest possible causes. Kendall 
and Kendall (1980) describe an interesting change-point problem in archaeology, 
and Brown, Durbin, and Evans (1975) give a number of econometric examples. 

In Section 3.4 we consider the likelihood ratio test of no change against the 
alternative of exactly one change. A large number of test statistics have been 
proposed for this problem, and there is no attempt to compare them here. The 
main conclusion is that the methods of Part 2 provide the basic tools to study a 
number of these tests without resorting to the numerical or Monte Carlo efforts 
that have been the basis of earlier studies (e.g., Sen and Srivastava, 1975). 

Section 3.5 is concerned with finding a confidence set for p. In the case where 
ua and p, are both known, we compare confidence intervals based on the 
maximum likelihood estimator 6, confidence sets (which generally are not inter- 
vals) based directly on the likelihood function, and a third method, also derived 
from the likelihood function. Hinkley (1970, 1972) has mentioned the first two 
methods, but he directs his efforts primarily at computational problems and does 
not compare the methods quantitatively. To minimize the computational 
difficulties and facilitate a simple comparison, we consider the case of Brownian 
motion. The likelihood based method is extended to the case of unknown 
nuisance parameters, Ho and u. 

Section 3.6 is concerned with testing the hypothesis of no change against an 
epidemic alternative. Here one encounters processes with a multidimensional in- 
dexing set, which introduce some new problems. The new methods of Part 2 can 
be used in some special cases, but in others an adaptation of ideas of Bickel and 
Rosenblatt (1973) or Qualls and Watanabe (1973) seems more fruitful. 


3.4. Tests against the alternative of exactly one change. The problem of 


testing the null hypothesis of no change, Hy: p? = - - = u™™ = uo, against the 
alternative of exactly one change, H,: 31 < p < m such that a = --- = pP! = 
pož p = pet = +--+) p (uy and p, both unknown) has been widely dis- 


cussed; a number of test statistics have been proposed. The quasi-Bayesian 
statistics of Chernoff and Zacks (1964) and Gardner (1969) are analytically 
tractable, but maximum likelihood type statistics have typically been studied by 
numerical or Monte Carlo methods (e.g., Sen and Srivastava, 1975, Worsley, 
1983). 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 387 


The square root of the log likelihood ratio statistic is proportional to 
(3.5) max {ISe = BS,/m|/[R( =- k/m)]'""}. 
A simple heuristic derivation of this statistic with minimal calculation is to 
suppose momentarily that H, specifies p = k. The problem then becomes a two 
population test to decide whether the mean (pa) of the first k observations 
equals the mean (1,) of the last m — k. The standard test statistic is the 
normalized difference between the mean of the first k observations, S,/k, and 
the overall mean, S,,/m. This is just (3.5) without the max, which accounts for 
the fact that p is actually unknown. 

Slightly more generally, we shall consider 
(3.6) | max {|S, — kS,,/m|/[R(1 - k/m)]'”*}, 
where 1 < m, < m, < m. (A justification is given below.) 

To obtain some intuition for the virtues and defects of (3.6) consider also the 
ad hoc suggestion of Pettit (1980) 


(3.7) max |S, — kS,,/m|. 
lsk<m 
Under H, the process S, — kS,,/m, k = 0,1,..., m is the same as the condi- 
tional process S,, k = 0,1,..., m given that S„ = 0, i.e., the same as a Brownian 


bridge observed as discrete instants of time. Hence an excellent approximation to 
the significance level of (3.7) can be obtained from (2.11) or (2.16) (multiplied by 
2 to account for the two-sided alternative); the significance level of (3.6) is 
discussed below. 

Under H, the drift of S, — kS„/m, k = 0,1,..., M, iS 


k(l - p/m)( po = k), k<p 
(p/m)(m—k)(po- i), k2p, 


and the residual process after subtracting out the drift is again a Brownian 
bridge observed at discrete instants of time. 

It seerns intuitively clear from (3.8) as illustrated in Figure 6 that (3.7) is more 
powerful than (3.5) for detecting changes that occur near m/2, whereas the 
converse is true for changes occurring near the endpoints 0 and m. 

It is intrinsically difficult to detect a change that occurs near one or the other 
endpoint, and the likelihood ratio statistic pays for its efforts to do so by giving 
up power near p = m/2. The introduction of m, and m, in (3.6) gives the 
statistician the flexibility to give up some power to detect changes occurring near 
the endpoints in return for an increase in power near m/2. 

By conditioning on S,, one can obtain approximations to the power of (3.6) 
and (3.7), which can be used to compare these statistics with each other and with 
other proposals (e.g., the recursive residual test of Brown, Durbin, and Evans, 
1975). A more complete discussion will appear in a future publication with B. 
and K. James. To illustrate the applicability of the methods of Part 2, and to 
prepare for the discussion of confidence sets in Section 3.5, an approximation to 
the significance level of (3.8) is given below. 


(3.8) 


388 D SIEGMUND 






Reject Ho (3 8) 


Reject Ho (3 7) 
(Mo=1, m,=m-1) 


Fic. 6 


Let x,,.¥),...,X,, be independent standard normal random variables, and put 
S,= 4, +--+ +x, (n= 1,2,...,m). We continue to use the notation 


PA) = P(A|S, =€), A GF(x,,...,%,). 
Let b > 0, m = 2,3,...,1 < my < m, and define 
(3.9) T = inf{n: n>m,,|S,| = b[n(1 - n/m)| P}. 
Let m, < m, < m — 1. The significance level of the test defined by (3.7) is 
Pi" T < m,} = PPhSn,] 2 b[m,(1 — m,/m)]'"\ 


(3.10) 


| PONT < my PY Sa, © dé}. 
j< bf, -m,/m)]'7 


THEOREM 3.11. Assume that b > œ, m, > œ, m, > 9%, m > coin sucha 
way that for some 0 < ty < t < land p, > 0 
m/m->t, i=0,1 and b/m =p.. 
Let £ = m$, for some |$o] € (p(l = DEt — to), wilt, — t yD. 


Then as m > æ, 


PAP <m}~ EAG = H ne [pe FE) E Fog = t,)| 
i Eo 
-mlw ü ani 


REMARKS. Substitution of this asymptotic expression into (3.10) suggests 
the approximation 


Pi {T <m} 
= 2bplb) f 


bim, om 


X exp 





where v ıs given by (2.12). 


(3.12) nie 


x“ p(x -+ b?/mx) dx + 2[1 T o(b)j, 


bytes 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 389 








TABLE 5 
R(T < m) 
Probability Exact or 

b Mo m, m approximation (3.12) Monte Carlo 
2.91 1 3 4 0.010 0.01 
2.65 1 9 10 0.052 0.05 
2.38 1 9 10 0.105 0.10 
2.38 4 8 10 0.057 0.058 + 0.002 
2.50 1 19 20 0.120 0.110 + 0.002 
2.50 1 10 20 0.066 0.063 + 0.001 
2.50 4 16 20 0.073 0.074 + 0.002 





Exact values from Worsley (1983) 


which can be shown to be a valid asymptotic relation (even if mọ and m — m; 
are o(m)). A proof of Theorem 3.11 along the lines of Theorem 2.18 has an 
interesting twist, leading to some new technical problems. An informal discussion 
is contained in Appendix 1. Siegmund (1985b, XI) derives (3.12) directly without 
first obtaining Theorem 3.11. However, we shall find Theorem 3.11 to be of 
interest in its own right in the next section. 


Table 5 gives an indication of the accuracy of (3.12). For comparison an exact 
numerical calculation from Worsley (1983) or the result of Monte Carlo experi- 
ment plus or minus one standard error is also given. There were 2500 repetitions 
of the Monte Carlo experiment, and importance sampling along the lines dis- 
cussed in Siegmund (1976) was used for variance reduction. 


3.5. Confidence sets for p. This section is concerned with finding confidence. 
sets for p, when uo and p, are regarded as nuisance parameters. Initially we shall 
assume that u and u, are both known, a case studied in considerable detail by 
Hinkley (1970, 1972), who suggested a method based on the maximum likelihood 
estimator 6 and a second method based directly on the likelihood function. In 
order to simplify the computational difficulties as much as possible and obtain a 
picture of the relative merits of these two proposals, we begin with the case of 
Brownian motion observed for 0 < t < m. As the results of Part 2 indicate, use of 
Brownian motion as an approximation usually yields quantitatively poor results 
for boundary crossing probabilities. For comparing competing procedures, how- 
ever, Brownian motion can be quite useful. 

Hence let W(t), 0 < t < m, be standard Brownian motion, and assume that 
the observed process X(t), 0 < t < m, satisfies 


dX(t) =p, dt+dW(t) for0<t<p 
=p,dt+dWw(t) forp<t<m, 


where pt, and p, are both known and p is unknown. There is no loss of generality 
in taking uo = 0. Put 6 = u. The likelihood function at p = t is proportional to 


390 D. SIEGMUND 
i(s)-U(p) 


t t+dt 


exp[ 58°¢ — 5X(¢t)]. Hence the log likelihood L(t) = 467 — 8 X(t) satisfies 
dl(t) = 1657 dt — 5dW(t), 0<iti<p, 
= —16*dt-8dW(t), p<t<m, 
i.e, Z(t) is Brownian motion with drift $67 or — +8? in accord with £< p or 
> p, and p is the time at which this process takes on its maximum value. 


It is easy to compute the distribution of 6, but to simplify the resulting 
expression we assume that p and m — p are effectively infinitely large. Consider 


P,{b — p € (t,t + dt), (6) — Up) E€ (x, x + dx)}. 


This joint density can be evaluated by (i) conditioning on i(t) — Kp) = 
x — y(dt)'/*, t+ dt) — lp) =x — 2(at)'”*; (ii) computing the (conditional) 
probability that the process /(s) — /(p) does not attain the value x for s < ¢ nor 
for s > t + dt and its maximum in the interval (t, t + dt) is in (x, x + dx); and 
(iii) integrating out y and z over (0,00). See Figure 7. The joint density is 
87 je) PL — exp(—x)]p(x/dle|'/? + +8jt|!) dx dt for x > 0 and 0 otherwise. 
Integration over x € (0, 00) and t € (r, œ) (r > 0) yields 
P{p—-p>r} = 0(—46r'”)($62r + 3) — br'g(46r'”7) 
—(2)exp(8?r)®(—36r!/7/2). 
It follows from (3.18) that a length of a 95% confidence interval for p obtained by 
treating 6 — p as a pivotal quantity is about 22/57. (Without the assumption 
that p and m — p are infinitely large, ô — p would not be an exact pivotal.) 
The definition of a likelihood based confidence set is very simple. For x > 0 let 
Alp) = {sup,[U(s) — U(p)] < x}. Choose x so that P{A(p)} = 1— a and define 


the confidence set to be those values ¢ for which Ht) > [(p) — x. Again assuming 
that p and m — p are effectively infinite, one easily sees that (cf. Figure 7) 


P{A(p)} = [1 ~ exp(-x)]’, 


(3.13) 


SO 
(3.14) x = —log|1 - (1 ~ a)'®]. 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 391 


Observe that the confidence set obtained in this manner is by no means an 
interval. In fact, because of the rapid fluctuations of Brownian sample paths, 
with probability 1 it consists of the union of infinitely many open intervals. 

To compare this likelihood based confidence set with the confidence interval 
determined above, we compute the expected size of the confidence set, 1.e. 


(3.15) E,[A{t: o € A(t)}] = T P, [A(t)] dt, 


where A denotes Lebesgue measure and w is a sample path (s) — l(p), -æ < 
s < oo. By conditioning on l(t) one can derive an expression for P,[ A(t)], which 
when integrated shows that (3.15) equals 


46~'{1 — exp(—x)] (x8! ~ (28) [1 — exp(—x)] } 
= 487?(1 ~ a)” { —log|1 - (1 - a)'?] - G =- a)'®}. 


For a 95% confidence set the expected size is about 10.5/8°, or less than one-half 
the length of the corresponding confidence interval based on the distribution of 
p. Numerical evidence indicates that the likelihood based set has approximately 
the same expected size advantage throughout the range of commonly used 
confidence levels. 

One can turn the likelihood based confidence set into a confidence interval by 
filling in the holes. More precisely let x > 0 and define 


R(L) = sup(inf){ : I(t) > sup l(s) - x}. 


If x is chosen to satisfy (3.14), then [ L, R] is the smallest interval containing the 
likelihood based confidence set and hence is a confidence interval of confidence 
> 1 — a. Under the assumption that p and m — p are effectively infinite, it is 
easy to show that in fact 


P{p e[L, RJ} =1-2P{R <0} =1-e, 


so for x = log a` !, [L, R] is a confidence interval of confidence exactly 1 — a. 

An evaluation of the expected length of this interval, ER — L), is not 
difficult in principle, but is extremely complicated in detail. I believe that as 
x — æ, hence a = e7 > 0 


E (R ~ L) = 87*(4log a7! + 2 — 3a/4 + o(a)). 


Hence for a 95% interval the expected length is about 14/8°, which is somewhat 
larger than the expected measure of the likelihood based confidence set and 
substantially smaller than the length of the interval defined by p as a pivotal 
quantity. 


REMARK. Cobb (1978) has proposed yet a fourth confidence set (interval) for 
p. Given a suitable tą > 0, Cobb treats (6) — I(t), |¢ — ô| < ty, as ancillary and 
bases his interval on the conditional distribution of ô — p given this ancillary 
statistic. Thus, if the likelihood function drops off sharply from its maximum at 
6, Cobb’s interval is short—a property shared by the likelihood based confidence 
set and the related confidence interval. There is some arbitrariness in the choice 
of tọ which seems a definite disadvantage if one tries to adapt this method to 


392 D. SIEGMUND 


the case of unknown po and u,, especially if m is of moderate size. Nonetheless, 
it would be interesting to compare Cobb’s method with those described above. 


Now suppose that our observations are x,,...,X,, as in Section 3.4, that the 
hypothesis of exactly one change is true, and that uo and p, are unknown 
nuisance parameters. A likelihood based confidence set for p can be defined as 
follows. The log likelihood ratio statistic for testing the hypothesis p = pọ 
against the alternative of arbitrary p is (cf. (3.6)) 


max |(S, — S,,/m)'/k(1 — k/m)| = (S,, ~ PoSn/™)°/po(1 = Po/m). 
Hence for 1 < my < m, < m and c > 0 define the events 


A(p,c)={ max [(S,~- bS,/m)*/k(1 - k/m)| 


~(S, ~ pS,/m)"/p(1 — p/m) < eè}. 


Although the unconditional probability of A(p,c) depends on both p and 
5 = H, — Hp, its conditional probability given that S, — pS,,/m = $, say, does 
not depend on ô. (Conditionally, S, — kS,,/m — ¢k/p, k = 0,1,..., p and S, — 
kS,,/m — (m — k)/(mp), k = p,p +1,...,m are two stochastically indepen- 
dent Brownian bridges in discrete time.) Hence one can in principle determine 
c = c(a, p, £) such that 


(3.16) P,{A(p,c)|S, ~ pS,,/m = £} =1-a. 
From (3.16) it follows immediately that the set of all p such that the sample 
path w = {S, — kS,,/m, k = 0,1,..., m} belongs to 

Alp, c(a, p, S, — pS,,/m)| 


is a (1 — a) 100% confidence set for p. 

To implement this procedure one must compute the conditional probability in 
(3.16); but this problem is already solved (asymptotically) in Theorem 3.11, as 
follows. 

In terms of the stopping time T defined in (3.9) 


(3.17) P,{A(p, c)|S, — pS,,/m = £} = PLT < p} + PEAT <m- p} 
3.17 
- | PPT < p} Pi" T < m= pj], 


where b = [c* + £7/p(1 — p/m)]'”*. If in addition to the asymptotic normal- 
ization of Theorem 3.11 one assumes that c? is proportional to m and p/m 
equals some constant in (0,1), then Theorem 3.11 and (8.17) yield 


P,{A(p,c)|S, — pS,/m = £} 
~ exp(—c?)[1 + p(1 — p/m)c*/é?] 
x {»[c?(1 — p/m) /é + £/p(1 — p/m)| 


+v[c%/mt + £/o(1 — p/m)|}, 
where v is defined in (2.12) and evaluated approximately in (2.13)-(2.14). 


1/2 


(3.18) 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 393 


TABLE 6 
P,{A(p, c)|Sp — pS,,/m = £} 

Probability Monte Carlo 
m c p ġo = m i approximation (3.18) estimate 
40 2.5 20 0.50 0.027 0.029 + 0.001 
40 2.4 2 0.25 0.059 0.058 + 0.001 
20 2.4 10 0.25 0.066 0.059 + 0.001 
20 2.4 6 0.25 0.057 0.052 + 0.001 
20 2.2 6 0.50 0.043 0.042 + 0.002 


Table 6 indicates the accuracy of (3.18). To obtain a Monte Carlo estimate of 
the desired probability, importance sampling (Siegmund, 1975) was used to 
obtain independent estimates of the two probabilities on the right hand side of 
(3.17). The standard error of the overall estimate was obtained via the obvious 
Taylor series expansion. The number of repetitions in each Monte Carlo experi- 
ment was 900. 


3.6. Tests against the epidemic alternative. In this section we consider several 


tests of the hypothesis of no change Hy: p® = p? = --- = p = py, against 
the epidemic or square wave alternative, H,: 31 < p} < p < m such that p© = 
° = pled = fo pPI w eee arr AS, phe2t D = “ss mom p™ = uo 


Results for this problem are incomplete, and our goals will be (i) to show that 
these tests naturally involve new boundary crossing problems and (ii) to suggest 
possible approaches to their solution. The problems are different from those 
discussed earlier in this paper, and the methods of Part 2 seem of limited 
usefulness. A promising alternative approach is provided by the method of 
Pickands (1969) as developed independently by Bickel and Rosenblatt (1973) and 
Qualls and Watanabe (1973). The results presented here are joint work with M. 
Hogan, which are described in greater detail in Hogan and Siegmund (1986). 

Typically pọ and ô are unknown nuisance parameters, although often only 
one-sided alternatives with ô > 0, say, are of interest. We shall assume that there 
is some threshold change, ô, which one is interested in detecting and consider 
tests for the particular alternative ô = ô. Thus in effect we assume that ô is 
known for the purpose of deriving a test statistic, although a complete evalua- 
tion of that statistic would involve all values of 6, not just the hypothesized 
value 6. 

In soneraat to the alternative of exactly one change, the epidemic alternative 
has rarely been considered. See Levin and Kline (1984) and Bhattacharye and 
Brockwell (1976) for two quite different discussions. 

Assume that ô = ô, is known. In the unlikely case that uo is also known the 
log likelihood ratio statistic for testing H, against H, is proportional to 

Z = max [S, — jito — (S, — ito) — Fê — i)| 


eee 
z ax [5 wi ya min (9, — ing ~ i8o/2)|. 
eee Osi</ 


For the case of unknown #, Levin and Kline (1984) suggest the use of (3.19) with 


(3.19) 


394 D. SIEGMUND 


Ho replaced by its maximum likelihood estimate under H,, namely i, = m7'S,,, 
to obtain 
(3.20) Z= max [S,—JS,,/m-— (S,— iS,/m) — (j- i)8,/2]. 

O<i<jsam 


The actual log likelihood ratio statistic in the case of known ô = 8, and unknown 
to 18 easily calculated to be 


(3.21) Z= max {S,~S,—-(j-i)S,/m— t(j- i)[1 — (7 - i)/m]}. 
O<t<ysm 

Levin and Kline (1984) discuss Bernoulli and Poisson data; and in that 
context an important aspect of their test is their proposal to use a conditional 
distribution given Sm, which under H, does not depend on the unknown jo, to 
compute a significance level. In the Gaussian case under discussion here the 
conditional and unconditional distributions are the same. Since (3.20) is some- 
what easier to study than (3.21), an interesting question is to what extent the 
two statistics behave similarly. Presumably they do if the duration of the 
epidemic, p — p, is small compared to m, but not in general. 

For a completely different problem which leads to consideration of (3.20) in the 
special case 6, = 0 and the simpler framework of continuous time, see Adler and 
Brown (1986). For a different underlying random walk the probability that Z, in 
(3.19) is greater than b can be interpreted as the probability that among the first 
m customers of a G/GI/1 queue, at least one has a waiting time exceeding b. 

The appearance of a two-dimensional indexing set in (3.19)-—(3.21), correspond- 
ing to the unknown onset and disappearance of the epidemic, makes the null 
hypothesis sampling distributions of these statistics quite different from those 
discussed earlier. Approximations to the power function seem more complicated 
in detail, but for the most interesting range of parameter values do not seem to 
require fundamentally new ideas. The remainder of this section describes some 
promising methods for approximating the null hypothesis distributions of 
(3.19)—-(3.21). 

We begin with the relatively simple (3.19), which gives us an idea of what we 
can hope to achieve in the more complicated (3.20) and (3.21). 

Let Y, Y2... be independent, identically distributed random variables with 
Ecy,) < 0. Let S, = y, + --- +y, and for b > 9 define 
(3.22) 1 =1(b) = inf{n: S,- min Š, > b). 


sksn 
The following inequality is useful in analyzing (3.19). 
PROPOSITION 3.23. Let t= 1(b) be defined by (3.22), T, = inf{n: S, > 0}, 
and T = inf{n: S, € (0, 6]}. Then 
P{r(b) < m} < P(r, = oJ E{(m— T+1);T<m,S,2 b} 


3.24 = 5 
(3.24) + }, P{n<rt,< œ}P{T <m- n, Śr zb}. 


n=0 


Moreover, a lower bound for P{1(b) <m} is the right-hand side of (3.24) 


p 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 396 
divided by 1 + E{(m — T + 1); T < m, 8,2 b}. 


REMARK 3.25. With large deviation scaling and observations whose distribu- 
tion can be imbedded in an exponential family, one can use likelihood ratio 
identities similar to, but simpler than, those developed in Part 2 to obtain first- 
or second-order asymptotic approximations to P{r(6) < m}. For example, for 
the normal case in (3.19), if m exp({—4,b) —> 0 and 5,)m/26 — 1 is bounded below 
by some positive number, then 


(3.26) PAZ = b} ~ 8o(mdy/2 — b)»*( 8p )exp(—5yd), 


where v» is given by (2.12). If in fact m’exp(—6,b) — 0, then the error in (3.26) is 
K(6, exp{ - 5,6)(1 + o(1)), where K is very complicated to evaluate exactly, but 
satisfies K(5) — 1 as ô — 0 (and better approximations are possible). Details of 
this asymptotic analysis will be presented elsewhere. 


PROOF OF PROPOSITION 3.23. Let œ; denote the n-shifted sample path, so 
Sor) = S,,,(@) — Sw). The event {r < m} can be decomposed into a union 
of disjoint events, 

m-—] 
{r<m}= U (r> n,S,= min S,,7(o7) <m- n, Śro) b). 
n=O O<sksn 
Hence by independence, 
m-1 
P{r<m}= ¥ P{r>n,§, = min§,\P(T <m - n, Šr > b) 
wet ken 
m—i 
> } [P{S,- $, <0,Yk <n} — P{r<n}|P{T <m- n, Sr zb} 
n=() 
m-l 
> J}, [P{r,>n} -— P{r< m)}|P{T <m-n,S,2> b} 


n=Q 


[P{r, = 0} — P{r<m}]E{(m-T +1); T< m, Š 2b} 


I 


m-l 
+ È P{n<r,< œ}P{T <m- n, Šp 2 b}. 
n=0 
Rearranging gives the lower bound, and a similar, simpler argument gives the 
upper bound. 


In principle the method sketched in Proposition 3.23 and Remark 3.25 to 
approximate the distribution of (3.19) should also be applicable to (3.20). In this 
case because of nonstationarity one must decompose {Z, = b} not only according 
to the location of a (relative) minimum of the sample path but also according to 
the value of the process at the minimum. The details become much more 
complicated and are not pursued here. For the simpler case of Brownian motion 
it is straightforward to obtain what one expects to be very good upper bounds. 
Analyzing these asymptotically leads to the following conjecture. 


396 D. SIEGMUND 


CONJECTURE. Suppose m > oo, b > œ such that for some fixed 0 < £ < œ 
and —œ < a <Ñ, 
b/m=§ and §/m= $o. 
Then 


pa max [W(t)- W(s)] > b} 


(3.27) 
= [2m7 (2b — £)(b — £) + 1 + o(1)ļexp|-2m~'b(b — £)]. 


It is easy to derive the leading term in (3.27) (cf. Theorem 3.28 below). Since a 
standard reflection argument yields an exact evaluation of the two-sided prob- 
ability, 

Pert max | W(t) — W(s) > b), 
it is surprising that the one-sided problem should appear to be considerably more 
difficult. Hogan and Siegmund (1986) give a (very complicated) proof of (3.27) 
and discuss the relation of (3.27) to the two-sided probability. 

An alternative method for approximating the null hypothesis distributions of 
(3.19) and (3.20), which works equally well for (3.21), is that developed indepen- 
dently by Bickel and Rosenblatt (1973) and Qualls and Watanabe (1973). (Both 
of these papers generalize to multidimensional time parameters the method of 
Pickands (1969) for a linear time parameter.) Since the general results of these 
authors give a tail probability for the maximum of a Gaussian field in terms of 
an integral involving another complicated probability, it is not immediately 
evident that the computational problem has been essentially simplified. But for 
Gaussian fields built up from random walks (or Brownian motion ) in sufficiently 
simple ways one can use renewal theory to evaluate the required integrals in 
terms of the function » of (2.12). For illustration the tail behavior under H, of 
(3.20) and (3.21) is given below. 

Let x,,x,,... be independent standard normal random variables, and put 
S, =x, +: +x, Suppose b and m > œ in such a way that m~'b = ¢ isa 
fixed positive constant. 


, THEOREM 3.28. For §,> —$§ 


Pi max [S,—S,—m~(j—i)S, — (j—i)éo| > b) 


O<st<ysm 


~ v?[2(2¢ + o)] [2m(2k + Eo)(S + >) ]exp| —2ms(F + E)], 
where v is given by (2.12). 
THEOREM 3.29. Let £ > 0. Then for £ > 4 
P{ max [S-S mU ~ i), ~ Eol- C- mF i))] = b) 


~ v3(2Eq)mé3[}— 78] 'exp(—2mé,f), 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 397 


while for 0 < p < 4¢ it is 


~ v2(£, + Atam E + EAAS E — &/4) |” *exp| -2m( + ias], 
where v is given by (2.12). 


Note that in (3.26), Theorem 3.28, and Theorem 3.29 the function »(-) which 
accounts for excess over the boundary is squared, basically because of the 
two-dimensional time parameter. Since typical values for »(-) are the in the 
range 0.5 to 0.7, for these problems use of a simple Brownian motion approxima- 
tion, which replaces »(-) by 1, probably gives extremely poor results. 

The preceding discussion is only a beginning attempt to study the problem of 
change-points with epidemic alternative. It is included here to show how quickly 
natural generalizations of previous work lead into new territory, requiring new 
ideas. Two obvious questions are (i) how good are these approximations and (ii) 
what do they (presumably in conjunction with approximations for the power 
function) tell us about the relative merits of (3.20) and (3.21)? Preliminary Monte 
Carlo experiments indicate that the approximations are not nearly so accurate 
for small m as those given in Part 2, although the derivations of (3.26) and (3.27) 
might lead one to expect quite good approximations. 


APPENDIX 1 


Proof of Theorem 3.11. Both Theorems 2.18 and 3.11 can be proved by the 
method of Siegmund (1982), which requires rather lengthy analytic calculations. 
The somewhat different method used in this paper to prove Theorem 2.18 yields 
a considerably simpler proof of that result, so one naturally asks how well it 
adapts to related problems. As we shall see below, it gives the appearance of a 
relatively computation free proof of Theorem 3.11. However, there are some 
technical problems which seem to demand additional analytic computation for 
their complete solution. 

Let T be defined by (3.9) and assume the conditions of Theorem 3.11. We also 
use the notation P{7} from the proof of Theorem 2.18. Let 


Pym = f° PEPELO ~ 4) /me]'7[d = 6/0 = 41) 


x[(1 = 4) /me,]'7 dà. 
Also let T* = sup{n: n < m,,|S,| 2 b[n(1 — n/m)]'/*}, so PYEUT < m} = 
Py" (T * > mo}. As in the proof of Theorem 2.18 one easily calculates the 


Aap 


likelihood ratio of x„,..., Xm, under P{”" relative to PẸ? and obtains 
[n(1 ~ 4,)/(m -= njt]'Mexp[$S7/n(1 — n/m) — $4/m,(1 — m,/m)], 
where t, = m,/m (i = 0,1). Hence Wald’s likelihood ratio identity yields 
Pye {T* = my} [a -t )/t,|'exp[4( 6? i m/t (1 7 t,))| 
= EL" [(m — T*)/T*]'exp(—R*); T* = mo}, 
where R* = 1[S}./T*(1 — T*/m) — b?]. 


(A.1) 


(A.2) 


398 D. SIEGMUND 


The equation (A.2) is analogous to (2.23) in the proof of Theorem 2.18, and we 
try to evaluate it similarly. Let 


T= inffn: (E+ S} / (m, — n)|1-m`!(m, -n)| = b?) 
and 
R m a Wg F S.)°/(m, z r){1 v4 m~'(m, T r)| b?) 
Past experience with large deviation scaling leads one to expect that r/m > t in 


probability for some constant t. We consider the definition of 7 expressed in the 
general form 


(A.3) t= inf{n: mh(n/m,S,/m) > 0} 


and expand A in a Taylor series about (t, ut), where » denotes the drift per unit 
time of S, (from (A.1) p = &,/(1 — #,)). This shows that for n close to the 
random time t mh(n/m,S,/m) = mh — th, — pth.) + nh, + Shat >, 
where h, denotes differentiation with respect to the ith argument, and A, h,, 
and h, are evaluated at (¢, ut). The heuristic reasoning following (2.27) suggests 
that the higher-order terms play no role in determining the asymptotic distribu- 
tion of R „, which is thus obtained by applying (2.8) to nh, + S,A,. Although 
this conjecture is correct and allows one to obtain easily the result claimed in the 
statement of Theorem 3.11, there are two technicalities making a rigorous proof 
more difficult. (1) Unlike the situation in Theorem 2.18, the pm) process 
Sm, Z Sp N= M, Mm —1,--- is not a random walk, so the renewal theorem is 
not directly applicable. (11) Even if it were a random walk, the technical 
conditions of Lai and Siegmund (1977) are not fulfilled, at least not at the level 
of generality (A.3). 

Fortunately (ii) is solved by Hogan (1984), who considers nonlinear renewal 
theory for processes of the form (A.3). See Appendix 2. To circumvent (1), note 
that by (A.1) and (A.2) it suffices to evaluate 


Ex" [m — T*)/T*]'exp(—R3,); T* = mo} 
7 ES .{ [(m —m, + t)/(m, = T)| exp(-R m); TSM, — my} 
and then integrate out A. For A of the form 
A= mé,/(1 — t) + nmi”, 
one can easily calculate the likelihood ratio of x,,..., x, under Pj, relative to 


the unconditional probability P, (a = €)/(1 — ¢,)), which has essentially the 
same drift per unit time, to obtain 


Ei") {[(m —m,+17)/(m, - 7)]'“exp(—R,,); Tsm- mo) 
= E,{ {[m,(m —~m,+7)|'?/(m, — r)} 
xexp|—R_ ~ S m=) 
+2n(S.— pr)/m'” — in?r/(m, — r)|; TSM- mo}. 


It is now straightforward, but tedious, to use the asymptotic degeneracy of 
t/m, the asymptotic normality of [S, — ut]/r'/”, its asymptotic independence 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 399 


from F,, (as in Lemma 2.16), and the P,-limiting distribution of Rm given by 
Hogan’s (1984) nonlinear renewal theorem to evaluate this expectation and hence 
complete the proof of Theorem 3.11. 

Hu (1985) uses Hogan’s methods and a likelihood ratio argument to give a 
general nonlinear renewal theorem for conditional random walks in (smooth) 
exponential families. A beautiful related result is obtained by Neyman (1982). 


APPENDIX 2 


Nonlinear renewal theory. In Section 2.3 the renewal theorem was used to 
approximate the distribution of the excess of a random walk at its first passage 
across a linear boundary. In Theorems 2.18 and 3.11 similar problems arose with 
regard to first passages to nonlinear boundaries. In this appendix we survey the 
appropriate nonlinear renewal theory. 

The problem is complicated by the fact that the stopping time and the excess 
over the boundary usually can be defined in more than one way. Conceptually 
the simplest situation involves a stopping time of the form 


(A.4) T = T, = inf{n: S, => me(n/m))}. 


Here c({-) is a positive continuous function, S,, n = 1,2,... is a random walk 
with positive drift » = E(S,), and we assume that the ray pt crosses the curve 
c(t) at exactly one point, £, near which c(-) is twice continuously differentiable. 
The stopping rules r* and r introduced in the proofs of Theorems 2.18 and 3.11, 
respectively, are both essentially of this form. 

It follows from an argument based on the strong law of large numbers that 
T,,/m — t with probability 1 as m — oo. Since as m ~ oo the curve me(n/m) 
for n close to më flattens out, it is natural to conjecture that the excess over the 
curved boundary, R m= Sr— mc(T/m), converges in law to the same limit as 
the excess over the tangent to mc(-) at the point ¢, which is given by (2.8) with 
Š, = S, — ne(#). 

Although the conjecture of the preceding sentence is true under quite general 
conditions, in special cases it follows from a somewhat different result, which is 
considerably easier to prove. When possible, it is convenient to rewrite (A.4) in 
the form 


(A.5) T= T, = inf{n: ng(n7'S,) = a} 


for suitable g and a (depending on m). For example, for c(-) concave, so 
c,(-) = e(-)/(-) decreases, we find that g = 1/c;'. For the stopping time (A.5), 
the excess over the boundary is R, = Tg(S,;/T) — a. A Taylor series expansion 
of g yields 


ng(S,/n) = ng(n) + (S, — np)g'(u) + (S, — nu) g(t) /2n, 


where £, satisfies |f, — u| < |n7'S, — u}. If g(n) > 0 (as we shall assume), the 
linear part of ng(S,,/n) is a random walk increasing at a rate proportional to n, 
whereas the quadratic part is essentially constant. This leads one to suspect that 
the limiting distribution of R, is given by (2.8) with S, = ng(u) + 


400 D. SIEGMUND 


(S, — np)g'(u). This conjecture is also true and has been given an abstract 
formulation by Lai and Siegmund (1977). They consider stopping rules of the 
form 


(A.6) T = inf{n: S, + %, 2 a}, 
where S., n = 1,2,... is a nonarithmetic random walk with positive mean 


ji = ES, and %, changes sufficiently slowly, in a sense made precise below, that it 
plays no role in determining the limiting distribution of S; + 77;— a as a > oœ. 
A typical application is to prove that the limiting distribution of R, is as 
indicated above. Lai and Siegmund also apply their result to approximate the 
significance level, (1.2) with » = 0, of a repeated significance test. Lalley (1983) 
extends the Lai-Siegmund method to the much more difficult case of multi- 
parameter exponential families. 

In the proofs of Theorems 2.18 and 3.11 one wants to apply a renewal theorem 
to stopping times essentially of the form (A.4). In both cases the boundary curve 
c({-) is concave, so the stopping time can also be expressed in the form (A.5), to 
which the Lai-Siegmund result applies. However, since this reduction seems 
restricted to the case of one-dimensional random walks, we describe below a 
different approach due to Hogan (1984) which seems particularly well suited to 
multidimensional! problems. 

Actually 7* defined in (2.28) to prove Theorem 2.18 is almost of the abstract 
form (A.6) with S, = by?éj'n + Sp, 4, = 4m7'é7'S?, and a = 4£5'm(u? — €2), 
except that Lai and Siegmund do not permit 4, to depend on a. A suitable 
essentially trivial extension, modeled on Woodroofe’s (1982) reformulation of the 
Lai—Siegmund result, is given next. This performs the dual function of complet- 
ing the proof of Theorem 2.18 and of explaining the general nature of this class 
of results. Afterwards we discuss briefly the method used by Hogan (1984) to 
deal directly with stopping times of the form (A.3) or (A.4). 


THEOREM A.7. Let T be defined by (A.6), where Š, n=1,2,..., wa 
nonarithmetic random walk with positive mean ji = E(S,) and for all n, ù, = 
(a) is a measurable function of S),..., S_. Suppose also that for each à > 0 


cal | 


a max [Mal ->p 0, a> 0, 


lensa 


and for each À, £ > 0 there exists § = 8(X, £) such that 


A.8 max P max |n 7 7 > e} < e. 
l ) nsaà ( max li ae Tn 


Then asa > œ, forall x > 0, 
P{T < 0, Sp + Hp - a sx => H(x), 
where H is defined to be the right-hand side of (2.8). 
With the help of Theorem A.7 one can easily complete the proof of Theorem 


2.18. The critical condition in the statement of Theorem A.7 is (A.8). The 
method of proof involves conditioning on S, + 7,,, where z, is chosen so that 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 401 





Fic. 8 


Š, + 7,, is already close enough to a that by (A.8) ñ, is constant (to within e) 
for all n, < n < T, but it is far enough from a that the renewal theorem applies 
to the random walk Š,- Š,» n= n, + 1,.... Hence except for an event 
of arbitrarily small probability S, + %, and Sy, +%, + (S, -— S,,) cross a at 
the same time and have the same excess, to within e. See Figure 8. The renewal 
theorem gives the indicated limiting distribution of excess over the boundary for 
the second process and hence for the process of interest. 

It is easy to see where this argument runs into difficulty in dealing with a 
stopping rule of the form (A.4) or what is essentially the same, (A.3). If we 
assume that ES? < oo, the variability in S, is O(n!/*). Hence to have probabil- 
ity close to one that S, < me(n,/m), one must choose n, = mt — Km” for 
some large value of K. But a Taylor expansion shows that mc(n/m) and the 
tangent mc(f) + c’(é)(n — mt) are essentially the same only for in — mi| < 6m)” 
for small 6. To circumvent this difficulty one can introduce the auxiliary 
stopping time 


T, = inf{n: S, = me(n/m) — 6m}. 


From the fact that m~'T., > t and the assumption ES? < oo, it follows that 
with probability approaching 1, 


Sr, — [me(T,/m) — 6m] < max, (S, — S__,)7 + supie’(t) 
<isT; 


is o(m’/*) and hence me(T,/m) — Sy, is large. Moreover, during the approxi- 
mately 6m!/*/[p — c(é)] additional steps the random walk requires to cross the 
curve, the distance between the curve and its tangent is small, provided ô is 
small. Hence the Lai-Siegmund argument with the random time T, in place of 
n, shows that the time at which the random walk crosses the curve and the 
excess over the curve are with high probability equal to the time it crosses the 


402 D. SIEGMUND 


tangent and almost equal to the excess over the tangent. Thus the nonlinear 
problem is reduced to a linear one having an answer given by (2.8) with 
Š, = S, — net). 

This argument is easily made precise and also extended along the lines of 
Lemma 2.16. The result provides an appropriate tool for completing the proof of 
Theorem 3.11, or of Theorem 2.18 for that matter. 

Hogan (1984) develops a much more sophisticated version of the argument for 
stopping times of the form (A.3). He does not require that E'S? < oo, and in his 
definition of T, he replaces 6m'/* by a large constant K. This minimizes the 
smoothness conditions imposed on A (or c(-)). More importantly, however, 
Hogan’s method also proves a nonlinear renewal theorem in problems scaled for 
a diffusion approximation, where the methods described here and also 
Woodroofe’s (1976a) method fail completely. 

It would be interesting to give an abstract formulation of Hogan’s result for a 
stopping time as in (A.6), since the method seems much more general than the 
cases actually covered by Hogan’s theorems. 


Acknowledgment. I would like to thank the referees for their careful 
reading and thoughtful comments. 


REFERENCES 


ADLER, R. J. and Brown, L. (1986). Tail behaviour for suprema of empirical processes. Ann. 
Probab. 14 1-30. 

ANDERSON, T. W. (1960). A modification of the sequential probability ratio teat to reduce the sample 
size. Ann. Math. Statst. 31 165-197. 

ANDERSON, T. W. and DARLING, D. A. (1952). Asymptotic theory of certain “goodness of fit” criteria 
based on stochastic processes. Ann. Math. Statist. 28 193-212. 

ARMITAGE, P. (1975). Sequential Medical Trials, 2nd ed. Blackwell, Oxford. 

BERK, R. and Brown, L. (1978). Sequential Bahadur efficiency. Ann. Statist. 8 667-581. 

BHAT (8-Blocker Heart Attack Trial Research Group) (1981). 8-blocker heart attack trial—design 
features. Controlled Clin. Trials 2 275-285. 

BHAT ( 8£-Blocker Heart Attack Trial Research Group) (1982). A randomized trial of propranolol in 
patients with acute myocardial infarction. J. Amer. Med. Assoc. 147 1707-1714. 

BHATTACHARYA, P. K. and BROCKWELL, P. J. (1976). The minimum of an additive process with 
applications to signal estimation and storage theory. Z. Wahrach. verw. Gebiete 37 
51-75. 

BICKEL, P. and RosENBLATT, M. (1973). Two-dimensional random fields. In Multivariate Analysis 
3. (P. R. Krishnaiah, ed.) 3-15. Academic, New York. 

BIRNBAUM, Z. W. and Tinagry, F. H. (1951). One-sided confidence contours for probability distribu- 
tion functions. Ann. Math. Statist. 22 592-596. 

Borovgovy, A. A. (1962). New limit theorems in boundary problems for sums of independent terms. 
Selected Transi. Math. Statst. Probab. & 315-372. 

Brown, R. L., DURBIN, J. and Evans, J. M. (1975). Techniques for testing the constancy of 
regression relations over time. J. Roy. Statist. Soc. Ser. B 37 149-192. 

CHERNOFF, H. and Zacks, S. (1964). Estimating the current means of a normal distribution which is 
subject to changes in time. Ann. Math. Statsst. 35 999-1028. 

Coss, G. W. (1978). The problem of the Nile: Conditional solution to a change point problem. 
Buometrika 65 243-261. 

Cox, D. R. (1972). Regression models and life tables (with discussion). J. Roy. Statist. Soc. Ser. B 
34 187-220. 

Cox, D. R. (1975). Partial likelihood. Biometrika 62 269-276. 


BOUNDARY CROSSING PROBABILITIES AND STATISTICAL APPLICATIONS 403 


DANIELS, H. (1974). The maximum size of a closed epidemic. Ado. in Appl. Probab. 6 607-621. 

DELONG, D. M. (1981). Crossing probabilities for a square root boundary by a Bessel process. Comm 
Statist. A—Theor. Methods A10 2197-2213. 

DEMETS, D., Harpy, R, FRIEDMAN, L., and Lan, K. K. G. (1984). Statistical aspects of early 
termination in the beta-blocker attack trial. Controlled Chnwal Trials 4. 

DURBIN, J. (1985). The first pasage density of a Gaussian process to a curved boundary. J. Appl. 
Probab. 22 99-122. 

FAIRBANKS, K. and MADSEN, R. (1982). P values for tests using a repeated significance test design. 
Biometrika 69 69-74, 

FELLER, W. (1971). An Introducthon to Probability Theory and Its Applicahons 2. Wiley, New York. 

GARDNER, l. A. JR. (1969). On detecting changes in the mean of normal vanates. Ann. Math. 
Statist. 40 116-126. 

Giut, R. D. (1980). Censoring and Stochastic Integrals. Mathematical Centre Tracts 124. 
Mathematisch Centrum, Amsterdam. 

HALPERN, J. (1982). Points of interest on the grand tour. Technical report, Stantord University. 

HAYBITTLE, J. I. (1971). Repeated assessments of results in clinical trials of cancer trextment. 
British J. Radiology 44 793-797. 

HINKLEY, D. V. (1970). Inference about the change point ın a sequence of random variables. 
Biometrika 67 1-17. 

HINKLEY, 1). V. (1972). Time ordered classification. Biometrika 58 509-523. 

HoGan, M (1984), Problems in boundary crossings for random walks. Disseitation, Stantord 
University. 

Hogan, M. and SIEGMUND, I). (1986). Large deviations for the maxima of some Gaussian : andom 
fields. Adc. in Appl. Math. 8. 

Hu, i. (1985). On repeated significance tests. Dissertation, Stantord University. 

JENNEN, C. and LERCHE, R. (1981). First exit densities of Brownian motion through one-sided 
moving boundaries. Z. Wahrsch. verw Gebiete 55 133-148. 

KENDALL, 1). G. and KENDALL, W. S. (1980). Alignments in two-dimensional random sets of points. 
Ad. in Appl. Probab, 12 380-424. 

LAI, T. L and SIEGMUND, D. (1977). A nonlinear renewal theory with applications to sequential 
analysis I. Ann. Statist. 5 946-954, 

LALLEY, S. (1983). Repeated hkelihood ratio tests for curved exponential families. Z. Wahrsch. 
terw. Gebiete 62 293-321. 

LEVIN, B. and KLINE, J. (1984). The cusum test of homogeneity, with an application to spontaneous 
abortion epidemiology. Preprint, Columbia University. 

LORDEN, G. (1971). Procedures for reacting to a change in distribution. Ann. Math. Statst. 42 
1897-1908. 

MILLER, R. G. (1970). Sequential rank tests—one sample case. Proc. Sixth Berkeley Symp. Math. 
Statist. Prob. 1 97-108, Univ. Calitornia Press. 

MILLER, R. G., JR. and SIEGMUND, D. (1982). Maximally selected chi square statistic. Biometrics 38 
1011-1016. 

NEYMAN, A. (1982). Renewal theory for sampling without replacement. Ann. Probab 10 464-481. 

O'BRIEN, P. C. and FLEMING, T., R. (1979). A multiple testing procedure for clinical trials. Bromet- 
rics 35 549-556. 

Pace, E. S. (1954). Continuous inspection schemes. Biometrika 41 100-115. 

Peto, R., PIKE, M. C., ARMITAGE, P., BRESLOW, N. E. Cox, D. R., Howarp, S V, MANTEL, N, 
MCPHERSON, K., PETO, J., and Smrru, P. G (1976) Design and analysis of randomized 
clinical trials requiring prolonged observation of each patient. British J. Cancer 34 
585-612. 

Pettit, A. N. (1980). A simple cumulative sum type statistic for the change-point problem with 
zero-one observations, Biometrika 67 79-84. 

PICKANDS, J. (1969). Upcrossing probabilities for stationary Gaussian processes. Trans. Amer. 
Math. Soc. 145 51-73. 

Pocock, S. (1977). Group sequential methods in the design of an analysis of clinical trials. Bro- 
metrika 64 191-200. 


404 D. SIEGMUND) 


POoLLaK, M. (1985). Optimal detection of a change in distribution. Ann. Statist 13 206-227. 

QUALLS, C. and WATANABE, H. (1973). Asymptotic properties of Gaussian random fields. Trans. 
Amer. Math. Soc. 177 165-171. 

SAMUEL-CAHN, E. (1974). Repeated significance tests II, for hypotheses about the normal distmbu- 
tion. Comm. Statist. 3 711-733. 

SAMUEL-CAHN, E. and Wax, Y. (1986). A sequential “gambler’s-ruin” test for p, = pz, when p, are 
small, with a medical application, and incorporation of data accumulated after stopping. 
To appear in Biometrics. 

SELLKE, T. (1985). Evolution of the partial likelihood over time. Technical Report, Purdue Univer- 
Sity, 

SELLKE, T and SIEGMUND, D. (1983). Sequential analysis of the proportional hazards model. 
Biometrika 10 315-326. 

SEN, A. and SRIVASTAVA, M. S. (1975). On tests for detecting changes in means. Ann. Statist. 3 
98-108. 

SHIRYAYEV, A. N. (1963). On optimal methods ın earhest detection problems. Theory Probab. Appl 
8 26-51. 

SIEGMUND, D (1976). Importance sampling in the Monte Carlo study of sequential tests. Ann. 
Statist. 4 673-684. 

SIEGMUND, I). (1977). Repeated significance tests for a normal mean. Riometrika 64 177-189. 

SIEGMUND, D (1978). Estimation following sequential tests. Biometrika 65 341-349. 

SIEGMUND, I), (1982). Large deviations for boundary crossing piobabihties. Ann. Probab. 10 
581-588. 

SIEGMUND, 1). (1985a). Corrected diffusion approximations and their applications. In Proe. Berke- 
ley Conference in Honor of Jerzy Neyman and Jack Ktefer 2 (L. Le Cam and R. Olshen, 
eds.) 599-617 Wadsworth, Monterey, Calif. 

SIEGMUND, 1). (1985b). Sequential Analysis: Tests and Confidence Intervals. Springer, New York. 

SIEGMUND, I). and Yuu, Y.-S. (1982). Brownian approximations to first passage probabilities. Z. 
Wahrsch. verw. Gebiete 59 239-248, 

STAM, A. J. (1968). Two theorems in r-dimenmonal theory. Z. Wahrsch ceru. Gebiete 10 81-86 

WoopROOFE, M. (1976a) A renewal theorem for curved boundanes and moments of fitst passage 
times. Ann. Probab 4 67-80. 

WooprooF_E, M. (1976b). Frequentist properties of Bayesian sequential tests Biometrika 63 101-110. 

WooDrROOFE, M (1978). Large deviations of the likelihood ratio statistic with applications to 
sequential testing. Ann. Statist. 6 72-84. 

WooDROOFE, M. (1979). Repeated ltkelthood ratio tests Biometrika 66 453-463 

WoopDROOFE, M. (1982). Nonlinear Renewal Theory in Sequential Analysts. Society for Industrial 
and Applied Mathematics, Philadelphia. 

WooDROOFE, M. and TAKAHASHI, H. (1982). Asymptotic expansions for the error probabilities of 
some repeated significance tests Ann. Statist. 10 895-908. 

WorSLEY, K. -J (1983). The power of hkelihood ratio and cumulative sum tests fo: a change in a 
binonnal probability. Biometrika 70 455-464. 

Yuu, Y.-S. (1982). Second order corrections for Brownian motion approximations to first passage 
probabilities. Ade. in Appl. Probab. 14 566-581 


DEPARTMENT OF STATISTICS 
STANFORD UNIVERSITY 
STANFORD, CALIFORNIA 94305 


The Annals of Statistics 
1986, Vol 14, No 2, 405-417 


MAXIMUM LIKELIHOOD ESTIMATORS AND LIKELIHOOD 
RATIO CRITERIA IN MULTIVARIATE COMPONENTS 
OF VARIANCE? 


By BLAIR M. ANDERSON,” T. W. ANDERSON, AND INGRAM OLKIN 
Stanford University 


Maximum likelihood estimators are obtained for multivariate compo- 
nents of variance models under the condition that the effect covanance 
matrix is positive semidefinite with a maximum rank. The rank of the 
estimator is random. The estimation procedure leads to a likelihood ratio test 
that the rank of the effect matrix is not greater than a given number against 
the alternative that the rank is not greater than a larger specified number. 
Linear structural relationship models and some factor analytic models can be 
put into this framework. 


1. Introduction. If the effects of factors or classes are random, the analysis 
of variance model is called the components of vartance model or model IJ. When 
the effects and errors are normally distributed, the multivariate one-way model 
is described by the covariance matrices of the effects and of the errors and an 
overall mean vector. In the balanced case with replications a sufficient set of 
statistics consists of the between class vector sum of squares, the within class 
vector sum of squares, and the overall sample mean. Linear combinations of the 
vector sums of squares yield unbiased estimators of the two model covariance 
matrices, but the estimator of the effect covariance matrix is not necessarily 
positive semidefinite. 

In this paper we find the maximum likelihood estimators under the condition 
that the covariance matrices are positive semidefinite, in fact, under the condi- 
tion that the effect covariance matrix is positive semidefinite with a maximum 
rank. The rank of the estimator is random; it depends on the roots of a certain 
determinantal equation. The estimators depend on the corresponding vectors 
associated with a matrix equation. The estimation procedure leads to a likeli- 
hood ratio test of the null hypothesis that the rank of the effect matrix is not 
greater than a given number against the alternative hypothesis that the rank is 
not greater than a larger specified number. The usual asymptotic theory does not 
hold; except in special cases —2 times the logarithm of the likelihood ratio 
criterion is not a x?-distribution. 

Linear structural relationship models can be put into this framework. 
The effect vectors can be considered as the random systematic parts. Linear 


Received May 1985; revised July 1985. 

'Supported in part by the National Science Foundation Grants DMS 82-19748 and DMS 
84-11411. 

2 Current address is 833 N. Curson Avenue, Los Angeles, CA 90046. 

AMS 1980 subject classyicattions. Primary 62H12; secondary 62J10. 

Key words and phrases. Multivariate analysis of variance, linear structural models, factor 
analysis models. 


405 


406 B. M. ANDERSON, T. W. ANDERSON, AND I. OLKIN 


combinations of the components of the systematic parts being constant is 
equivalent to those linear combinations of the covariance matrix of the sys- 
tematic parts being zero. We also obtain maximum likelihood estimators of the 
linear structural relationships. 

In Section 2 we consider the case of the error covariance matrix being 
proportional to the identity; that is, the components of the errors are identically 
and independently distributed. In this case we do not need replication. In 
Section 3 we treat the case of the error matrix being proportional to the identity 
with replications; although this case may not be directly applicable, it is relevant 
as a transition from Section 2 to Section 4. Finally, in Section 4 we study the 
case where the error covariance matrix is unrestricted and there are replications. 

There is a history of results dealing with components of various models in 
which particular aspects have been treated. In this paper we attempt to bring 
together specific results in a coherent manner. Although some of the results are 
new, we have given special attention to an exposition of the field. 

In an unpublished paper Anderson (1946) gave the results for the case of an 
unrestricted covariance matrix. In another unpublished paper Morris and Olkin 
(1964) independently obtained these results. Most of these were announced by 
Anderson (1984), where an extensive list of references is given. Theobald (1975) 
obtained the estimators of Section 2 (in slightly more generality) by the same 
method that we are using. More recently Schott and Saw (1984) have derived the 
maximum likelihood estimators and likelihood ratio criteria in Section 4 by a 
somewhat different method. 

Klotz and Putter (1969) obtained maximum likelihood estimators in a differ- 
ent form when no rank condition is imposed. Amemiya (1985) has also treated 
this problem. Amemiya and Fuller (1984) have found modified maximum likeli- 
hood estimators and likelihood ratio criteria when the rank is specified exactly. 
Rao (1983) proposed estimation and testing procedures that are related to 
maximum likelihood. 


2. MANOVA without replication. In the simplest case of MANOVA with 
random factors there is one observation per cell. Let the p-component observable 
random vector be 


(2.1) X,=pt+ V+ U,, a=1,...,7, 

where p is a constant (unknown) vector, V,,..., V,,U,,...,U, are independent 
unobservable random vectors with means 0 and covariance matrices 

(2.2) EVN! = 8, SUU; = o7I. 


A vector U, is interpreted as composed of errors that are uncorrelated and have 
a common variance o°. (If SUU? = o° P, where Y, is known, the model can be 
transformed to replace Y, by I; see Theobald (1975).) The vector V, represents 
the effect of factors and characterizes the cell or class. The covariance matrix © 
of rank m, 0 < m < p, is not necessarily positive definite. The covariance matrix 
of the observed X, is 


(2.3) C(X,) = &(X, — p(X,- p) = 071 + @. 


MULTIVARIATE COMPONENTS OF VARIANCE 407 


The effect vector V, is said to satisfy linear structural relationships if there is 
a q X p matrix B of rank g such that BV, = 0 with probability 1. This implies 
that BÊ V V = 0, that is, 
(2.4) BO = 0. 


The matrix B is not uniquely determined since it can be multiplied on the left by 
an arbitrary nonsingular matrix of order g. The rank of B can be taken to satisfy 
m + q = p. We shall obtain the maximum likelihood estimators of u, o°, @, and 
B under the assumption that the joint distribution of the random vectors is 
normal. 

Let the observations be x,,...,x,,. The sample mean vector X = (1/n)Li_ x, 
and covariance matrix 

n 


1 
(2.5) E 2, (x, ~ I)(x, — E) 


ass | 


are a sufficient set of statistics. The logarithm of the likelihood function L is 


n 
log L = — ba log 2a — — log|o*I + @| 
2 2 
(2.6) 5 : 
ʻo tr(o?I + 8) 'C — z-e) (0°! +0) '(z-p). 


For any positive semidefinite @ and positive o?, log L is maximized with respect 
to p at ĝ = x, so that the concentrated likelihood function is equivalent to 


(2.7) log L* = —logjo?I + O| — tr(o21 + 8)7'C. 
The canonical form of C is 
(2.8) C= WD.W’, 
where 
(2.9) D, = diag(t,,..., ¢,), 
(2.10) W = (w,,..-,W,), 
i > +- >, are the ordered characteristic roots of C (distinct and positive 
with probability 1), w,,...,w, are the corresponding characteristic vectors of C 
normalized by w/w, = 1, and diag(¢,,..., £) represents a diagonal matrix with 
t,,...,£, as the diagonal elements. The matrix W is orthogonal. The canonical 
form of «71 + @ is 
(2.11) a7] E 9 = rD,T”, 
where 
(2.12) D, = diag( ô}, ..., ôn), 
(2.13) I= (Y-Y), 
> -> 26, are the ordered characteristic roots of o I + O, y,,...,¥, are 


corresponding characteristic vectors normalized by y/y, = 6,,, the Kronecker 
delta; so I’ is orthogonal. If © is of rank m, ôm, = > =6,= 07> 0. 


408 B. M. ANDERSON, T. W. ANDERSON, AND I. OLKIN 


In terms of the canonical forms the concentrated likelihood is equivalent to 
log L* = —log|D,| — tr PDs *T’WD,W’ 
2.14 a 
et) = — $ logô, — trDs(I’W)D,(T’ WY, 


=} 


which is to be maximized with respect to orthogonal I and diagonal D, subject 
to ô 2 --- > ĝm > m41 = ` = ô, > 0. We use the following theorem of von 
Neumann (1937). 


THEOREM (von Neumann). For Q orthogonal and D, and D, diagonal 
(5,2 e 26,>0,t,2 e 2t,>0) 


(2.15) min tr D; 'QD,Q’ = tr D5 'D,, 
and a minimizing value of Q is Q = I. 


REMARK. For any multiplicities of the ĝ’s and ¢’s the set of minimizing Q’s 
is found from tr Dy 'QD,Q’ = tr Dy ‘D, 


The maximum of (2.14) with respect to orthogonal [’ W (or orthogonal I) is 


P t 
- toga + :| 


i 


p 
— > logô, — tr Dy 'D, 


t~ - ô, 

(2.16) ‘7! gi 

m t, MOE ee 

=- 5 log ô, + = —|qlogo?+— È 4l. 

:=1 t t=m+1 
The maximum of (2.16) with respect to ô, is at $, = £, i= 1,...,m, and with 
respect to o? is at ô? = L?_,,4,t,/g; then 6,,,,= °°: = Â, = 67. 

Let 
D, 0 

(2.17) D, = a D, ? W = (W, W3), 
where D, is m X m and W, has m columns. Then a maximizing D, and Q = I “W 
are 
2,18) Ò a 3 q=|* N 
a * jo êL? J0 Q? 





where Q, is any orthogonal matrix of order q. Then 


Î = WQ = (W, wfe A 


(2.19) 0 Q; 


=F (W, W.Q)), 


MULTIVARIATE COMPONENTS OF VARIANCE 409 


and the maximum likelihood estimator of o7I + @ is 
671,+ 8 = fD, f 


D, 0 Ww, 
(2.20) = (W, W.Q; ) 0 871 heme 


= W DW’ + 67 WW. 
It follows that the maximum likelihood estimator of @ is 
a 6 = W, DW + ô°WW; — 671, 
2.21 
= W,(D, — 671, )Wy, 


which is positive semidefinite of rank m. 


Since 
F D 0 = 
tr(671, + 6) C= tr] Ww] - w| WD,Ww’ 
0 ôL 
Doo 
ial. 1 P 4 
(2.22) 0 zl, 0 D, 
I, 0 
= tr 1. 
0 32D, 
=p, 


the maximized value of the likelihood function is 


m P nq/2 -1 
(2.23) Lom) = erei ee 2 al eon . 


t=m+1 


The likelihood ratio criterion for testing the null hypothesis H,: m = m, 
against the alternative mọ < m < m, where m, and m, are specified integers 
between 0 and p, is 


ne n nq,/2 
L(m,) _ at ATE aai) i 


L(m,) TIPPER m pt0)" 





(2.24) t=mot1 n 
= i pn /2 (anit qı) 
temott (EP mat /0) 


where gy = p — my, and q, = p — m.. 


410 B. M. ANDERSON, T. W. ANDERSON, AND I. OLKIN 


A special case is m, = p, that is, the alternative is that the rank of © is 
greater than m,. The likelihood ratio criterion is 


1/@a u72 
L(m,) _ (Pe ai,) a ak 


2.25 = 
t ) L(p) Dima +t / Qo 


This is the tng, power of the ratio of the geometric mean of the g, smallest 
roots to the arithmetic mean of these roots. 


3. MANOVA with independent errors and replications. We now con- 
sider the balanced MANOVA with & replications in each cell: 


(3.1) X,,=ptv,+U 


mj? 


J=1,...,k, a=1,...,n. 


The unobservable random vectors V,,..., Va, U,,,...,U,, are independent with 
means 0 and covariances 


(3.2) &VN'=0, UU =F, j=1,...,k,a=1,...,n. 


aj ap 


We assume that the rank of @ is less than or equal to m. The covariance matrix 
of X* = (X/.,,..., X/Y is 


(3.3) &(X*-e@p)(X* —e @p) =-10 V+ ee’ 989, 
where e = (1,1,...,1), X*,...,X* are independently distributed, and ® de- 
notes the Kronecker product. The inverse of (3.3) is 
1 1 7 
(3.4) i — zee) & P- + ge (F -+k0) . 


The determinant of (3.3) is |¥|+- 1} + kOJ. Note that the covariance matrix of 
Xa = (1/k)D4_ | X,, is (1/k XF + kB). 

If Xipe- Xip X2- -Xp are the observation vectors, the logarithm of the 
likelihood function is 


n(k — 1) 
2 


pnk n 
log L = — Ea log 2a — log|¥| — 7 log|¥ + k®| 


Le 1 
= y (x*-e@ wy (1 — 5e| ey! 


am | 
1 B 
(3.5) tree e (¥+kO) |(x*-e @y) 


pnk n(k - 1) n 
ae log2a — a = log|¥| — ri logi¥ + 20] 


l 
x 5 (ror + trH(¥ + kO)’ + nk(—p)(¥ + RO) '(z-p)], 


MULTIVARIATE COMPONENTS OF VARIANCE 41l 


where 
(3.6) H=} 3 A E 
n k 
(3.7) G = 2 3 (xa; 3 X_)(Kq, x.) 


X, = (1/kEf Aaj and xX = (1/n)L"_,x,- A sufficient set of statistics consists 
of H, G, and x. The maximum of log L with respect to p is at fi = xX. 

First we consider the case of ¥ = o7I. Let (1/n)H = WDW’ and o*I + 
k® = T D,I”, where again W and T are orthogonal and D, and D, are diagonal. 
Then the concentrated likelihood is equivalent to 


p 
— pn(k — 1)logo? — n >> logô, 
(3.8) =: 


l f 
moa trG — ntrD,'(I’W)D,(T’W)’. 
The maximum of (3.8) with respect to orthogonal I’ W is 


~ | pn(k — 1) + ngllogo? -n } log 8, 


i=] 


3. 
2) 1 t, n if 
eg GR) e et 
c tm] “r O =m+1 


We want to maximize (3.9) with respect to ô; => --- > ôm, and ø? subject to 
ô; 2 0?, j = 1,..., m. Note that the concentrated likelihood function is a strictly 
concave function of 1/0” and 1/8, j= 1,...,m, and hence the maximum is 
unique. The derivatives of (3.9) with respect to 5,,...,6,,, and o? are 


«9 Um 


(3.10) ae ce 
. ore 92? Jọ i;e’ M, 
ô; 8° 


n(k-1)+ trG + nÈ meit 

(3.11) es ee 

o (0°) 

Let a = pn(k — 1) + ng and A = trG + nLVe...it, Let m* = m if A/a < tm 
otherwise let m* be such that 


n A 
Cnt + + z [(m -m*— lezi ne Ent 42 eee —t,,| s a 
(3.12) ; 
< ti» + zlin- m*)t — treet —t,,|; 


412 B. M. ANDERSON, T. W. ANDERSON, AND I. OLKIN 


that is, if 
p 
n(k~-1)+ng*lt,~4,s trGtin t; 
(3.13) [ pn( ) q*ltms+i O, 


< [pn(k — 1) + ng*ļ tms, 
where q* = p — m*. Then ô, =f,i1=1,...,m"*, and 


tr G + Nu. ma" +16 


(3.14) gv? = org ee er 

Let 

(3.15) D = diag(t,,...,tn«), Dž = diag(tye+1.--->t); 

(3.16) W= (Wt W-), 

where W,* has m* columns. The maximizing D, and Q = I“W are 
* 

eit) D, = p pn Q= z a 

where Q% is any orthogonal matrix of order g*. Then 

(3.18) f= wa’ = (We WFQ’), 

and the maximum likelihood estimator of o*I + k@ is 

(3.19) 6*71, + RO = WeDFW," + 6? WW’. 


The maximum likelihood estimator of @ is 


1 
Ô = z (we brw + ô WEWE — ô* L) 
(3.20) 


1 
= WED? - 6*71,,.) Wi", 


which is positive semidefinite of rank m*. The maximized value of the likelihocd 
function is 


m* —j 
(3.21)  L(m*) = | (20)??? T] er2(gx2 ert onpk/2] 
t=] 


The likelihood ratio criterion for testing the null hypothesis Hp: m = mo 
against the alternative mọ < m < m; is 


e A Co A 


3.22 = >. onl pik—-D+e"l/2 ? 
( ) L(m*) lame ii t Ca ee 





mb- 


MULTIVARIATE COMPONENTS OF VARIANCE 413 


where ô? and ôf? are given by (3.14) for m = m#ë, q= qč and m = mf, 
q = qi, respectively, for më < m*¥ and is 1 if m§ = mf. 


4. MANOVA with replications. We now consider the model (8.1) with 
£ U, U?:, = ¥ unrestricted. Let the roots of 


aj as 


1 
(4.1) —H — d—-—-G| = 0 
be d, > --: >d, > 0; the roots are distinct and positive with probability 1. 
Define Z by 
(4.2) - H = nZD,72’, G =n(k —1)ZZ’, 
where D, = diag(d,,..., dp). Then 


(4.3) G = ZDZ — ZZ’ = 1(D, — 1,,)Z’ 


—_—R—- —-——_ 
n(k — 1) 


is an estimator of £®. 
Let the roots of 


(4.4) (FY + kO) — 6¥/=0 

be ô > + 26,>6,4, = ++: = ô, = 1. Let diagonal D} and nonsingular T 
be defined by 

(4.5) Y + kO = TDI”, F = TT. 


Then the log likelihood function concentrated with respect to i= X is 
— jnpk log2a plus 


n(k — 1) 


n 
— nk logil| — = loglD;| ~ —— 


n 
tr(TT) ZZ — s tr(TD;,r) ‘ZDZ’ 


i 


n 
~ nk log|T| — > log|D,] 


o n(k— 1) 


(4.6) ; 


n 
tr T ZZT — > tr Dp T ZDZ T 


n n(k - 1) 
—nk log|T| — 5 log|D,| — a tr 


I 


(TO ZDY?)Dz'(T~'ZDY?)’ 
- = tr Ds '(IZDY?)(T'ZDY?Y, 


which is to be maximized with respect to D, and T. We use the singular value 
decomposition: 

(4.7) TZD” = PDQ, 

where P and Q are orthogonal, and D, is diagonal with r; 2 n> --: >7,>0. 


414 B. M. ANDERSON, T. W. ANDERSON, AND I. OLKIN 


Then 
(4.8) r! = PDQD} ZZ}, 
(4.9) ID = T = Dy (Ze Dal. 


The concentrated log likelihood function to be maximized with respect to P, Q, 
D,, and D, is a function of Z and D, plus 


n(k — 1) 
2 


n 
- 5 trD; 'PD,Q(PD,Q)’ 


n 
nk log|D,| — = loglD;] — tr PD,QD; '(PD,Q)’ 


n n(k — 1) 
= nk log|D,| — — log|D,| - ————— tr PD,QD;'Q’D,P’ 
2 2 
(4.10) 3 
- 5 tr Dy 'PD,QQ’D,P’ 
n n(k — 1) 
= nk log|D,| ~ = log|D,| - ———— tr D2QD;'@ 


n 
_ z tr D; 'PD?P’. 


Von Neumann’s theorem shows that the maximum of (4.10) with respect to P 


and Q is 
n(k — 1) 


n 
nk log|D,| — = loglD;] - —; 


n 
tr D? D7 ' = z Dy D; 


r 











n P : r? or? 
. =—) (kl — log ô, — (R-1)— - — 
(4.11 = Yow? — tog, (= 15- | 
n P k-1 1 
= gd (+ log r? — r| d + d = og). 
Since 
(4.12) max [a logx — bx] = aloga — alogb- a 
and occurs at x = a/b, the maximum of (4.11) with respect to r}, ..., 7, is 
n £ k-1 1 
(4.13) - EY (loga, + bog + + + k- blog, 
tm] t i 
and occurs at r? = k[(k — 1)/d, + 1/8,]~', i = 1,..., p. The maximum of (4.18) 
with respect to 5,,...,6, over the region 6, > > 26,>6,,,;= 7 =6,=1 
occurs at 
ô, =d,, if d,>1 f : 1 
ori=1,...,m, 
(4.14) §=1, ifd,<1 
§ = 1, for: = m + 1,..., p. 


MULTIVARIATE COMPONENTS OF VARIANCE 415 


Let p* be the number of d,> 1, m* = min(m, p), and q* = p — m*. Then 


ô, = d,,i=1,...,m*, and 6 = 1, i = m* +1,..., p. The maximum of (4.13) is 
n(p— m*)k k n(k-1) ™ nk P 
(4 15) yom | r= m™*+1 


nk P 
-— $ log(k-1+d,). 


™ m” +) 


When we go back to (4.10) we see that a maximizing Q is Q = I (unique except 
for multiplication of each diagonal element by —1) and a maximizing P is 


4.16 P = Ime 0 
(4. ) = 0 P, |’ 


where P, is an arbitrary orthogonal matrix of order g* = p — m*. Let 
(4.17) Dy = diag(d,,...,d,+), Dy = diag(d,.4,,.-.,d,), 


(4.18) Z=(Z* Zs), 

where Z* has m* columns. Then 

(4.19) D, = 3 y D, = | k wA 
In these terms 

(4.20) D? = k| (k - 1)D7' + Ôp +|. 
From (4.8) we obtain 

(4.21) f = ZDA D> P, 


from which we obtain 
= ZD” DDZ 





La ya EIn 0 Z” 
(4.22) 7 rac Z3) 9 (k — 1)I,. + Dz }\ zy 
k-1 
= LSD + —— Z327 + gy eDi“y, 
¥ + kô = PD, f 
- ZD D P D, PD DYZ 
1 kD3 0 Z* 
4,23 se te * a 
ann ra 23) 0 M 


k-1 1s. 
= Z*DsZe + aT ZiZy + Z30427. 


416 B. M. ANDERSON, T W. ANDERSON, AND I, OLKIN 
Subtraction of (4.22) from (4.23) and division by k yields 


re 
(4.24) 6 = recat - Ip Z, 


m* 


which 1s positive semidefinite with rank m*. 
The maximized log likelihood function is 








npk nk 
log L(m*) = — ES log2r — nk log|Z| — ot log|D,] 
ng*k npk n(k-1) 7 
+ log k — —— + ————~~ log d 
5 ee 2 L G 
nk 2 n p 
(4.25) +— } logd,- — > log(k-1+d,) 
t=m*+l t=m™* + | te 
npk *k npk m* 
= — — log 2a + — —— — —— 
z 0827 ; log k ; nk log|Z| > L log d, 
p 
-— } log(k-1+d,). 
2 t—m* +] 


This expression agrees with the substitution of ¥ and ¥ + kÊ into (3.5). 

Any matrix B satisfying BÊ = 0 is a maximum likelihood estimator of B. In 
particular, Y, which consists of the last q* rows of (Z’)~', has the required 
property. Thus B is any nonsingular multiple of Y. 

The likelihood ratio criterion for testing the null hypothesis H,: m < my 
against the alternative mọ < m < m; 1s 


Dei AA a eaa eaa 





L(m*) kaem gn/2TTP (kh -1 +d.) 


mf k*d, n/2 om 


wd, (k-1+d,)* 


if m < m* and is 1 if m = m*. 


REFERENCES 


AMEMIYA, Y. (1985). What should be done when an estimated between-group covariance matrix is 
not nonnegative definite? Amer. Statist. 39 112~117. 

AMEMIYA, Y. and FULLER, W. (1984). Estimation for the multivariate errors-in-variables model with 
estimated error covariance matrix. Ann, Statist, 12 497-509. 

ANDERSON, T. W. (1946). Analysis of multivariate variance. Unpublished manuscript. 

ANDERSON, T. W. (1984). Estimating linear statistical relationships (The 1982 Wald Memonial 
Lectures). Ann Statist, 12 1—45. 

KLOTZ, .) and PUTTER, J. (1969). Maximum likelihood estimation of the multivariate covariance 
components for the balanced one-way layout. Ann. Math Statst 40 1100-1105. 


~~ 


MULTIVARIATE COMPONENTS OF VARIANCE 417 


Morris (ANDERSON), B. and OLKIN, I. (1964). Some estimation and testing problems for factor 
analysis models. Unpublished manuscript. 

VON NEUMANN, J. (1937). Some matrix-inequalities and metrization of matrix space. Tomsk Unwer- 
sity Review 1 283-300. Reprinted (1962} in John von Neumann Collected Works (A. H. 
Taub, ed.) 4 205-219. Pergamon, New York. 

Rao, C. R. (1983). Likelihood ratio tests for relationships between two covariance matrices. In 
Studies in Econometrics, Time Series, and Multivariate Statistics (S. Karlm, T. Amemiya, 
and L. A. Goodman, eds.) 529-543. Academic, New York. 

SCHOTT, J. R. and Saw, J. G. (1984). A multivariate one-way classification model with random 
effects. J. Multivariate Anal. 15 1-12. 

THEOBALD, C. M. (1975). An inequality with application to multivariate analysis. Biometrika 62 
461-466. 


DEPARTMENT OF STATISTICS 
STANFORD UNIVERSITY 
STANFORD, CALIFORNIA 94305 


The Annals of Statist. 
1986, Vol 14, No 2, 4)8~430 


ASYMPTOTIC THEORY FOR COMMON PRINCIPAL 
COMPONENT ANALYSIS' 


By BERNARD N. FLURY 
University of Berne 

Under the common principal component model k covariance matrices 
£, ..., £, are simultaneously diagonalizable, i.e., there exasts an orthogonal 
matrix B such that B’2,B = A, is diagonal for: = 1,..., k. In this article we 
give the asymptotic distribution of the maximum likelihood estimates of B 
and A,. Using these results, we derive tests for (a) equality of eigenvectors 
with a given set of orthonormal vectors, and (b) redundancy of p — q (out of 
p) principal components. The likelihcod-ratio test for simultaneous sphericity 
of p — q principal components in k populations is derived, and some of the 
results are Wlustrated by a biometrical example. 


1. Introduction. Common principal component analysis (CPCA) is a gener- 
alization of principal component analysis (PCA) to k groups (Flury, 1984). The 
key assumption is that the p X p covariance matrices &,,..., 2, of k popula- 
tions can be diagonalized by the same orthogonal transformation, i.e., there exists 
an orthogonal matrix B such that 


(1.1) Ho: B’2,B = A, (diagonal) (i =1.,...,k) 


holds. Ho is called the hypothesis of common principal components (CPC’s). 
Flury (1984) derives the normal theory maximum likelihood estimates of B and 
A. and gives numerical examples. 

In the one sample case k = 1, CPC’s reduce to ordinary principal components 
(PC’s). In this case the ML estimates of B and A = A, are the eigenvectors and 
eigenvalues of a Wishart matrix S,. The asymptotic distribution theory for this 
situation has been developed by Girshick (1939), Lawley (19538,1956) and 
Anderson (1963). The present paper gives essentially generalizations of results 
obtained by Anderson. 

In one-group PCA, the eigenvectors B, forming the orthogonal matrix B = 
(B,,...,B,) are usually ordered according to the associated eigenvalues A, > 
Ay > +++ > Ap In CPCA no obvious fixed order of the columns of B need be 
given, since the rank order of the diagonal elements of the A, is not necessarily 
the same for all A,. However, we can use some convention, e.g., that the columns 
of B be arranged according to the first group, i.e., such that B/2,B, > BÈ B, > 


Received September 1984; revised July 1986. 

'This work was done under contract 82.008.0.82 of the Swiss National Science Foundation at 
Stanford University. Partial support was also provided by NSF grant DMS 84-11411. 

AMS 1980 subject classyicatons, Primary 62H25; secondary 62H15, 6ZE20. 

Key words and phrases. Maximum likelihood, covariance matrices, eigenvectors, eigenvalues. 


418 


PRINCIPAL COMPONENT ANALYSIS 419 


-++ > BZB, (assuming that the p characteristic roots of 2, are all distinct). 
This will enable us to speak about first, second, or last atincipal components also 
in the k-group case. 

Tests for various hypotheses about B and A in PCA have been proposed by 
Anderson (1963). We are going to construct analogous tests for CPCA. More 
specifically, we will treat the following problems: 


1. Is the jth eigenvector B, identical with a given a eee vector B’? More 
generally, for q different eigenvectors 8, +++ Bag- are they identical 
with q given (orthonormal) eigenvectors B), 0 ? This problem will be 
treated in Section 3. 

2. As an associate editor handling the previous paper (Flury, 1984) has pointed 
out, the most useful applications of CPCA would probably be those in which 
some relatively small number q of rotated axes are sufficient to recover most of 
the variability in each of the k groups. It is therefore useful to have a criterion 
for neglecting CPC’s with small contributions. A solution to this problem is 
given in Section 4.1. 

3. When PC’s are interpreted, it is important to make sure that the roots A, and 
A, (say) are not identical, because otherwise the associated eigenvectors B, and 
B, are not uniquely defined. Similarly in CPCA two eigenvectors B, and B, are 
uniquely defined if in at least one population the two associated eigenvalues 
are not identical. A likelihood ratio test dealing with this problem is given in 
Section 4.2. 


"I he+gq~-1° 


We will from now on always assume that the matrices 2,,..., 2, are positive 
definite symmetric (p.d.s.). The diagonal elements of A, will be denoted by X,,, 
i.e., A, = diag(A,,,.--,A,,) (È = 1,..., k). All results will be based on k indepen- 
dent sample covariance matrices S, with n, degrees of freedom, respectively, such 
that 7,S, has the Wishart distribution Win, %,). The ML estimates of P7 = 
(B,,-..,B,) and A, are denoted by ĝ = (B,,...,8,) and A, = diag(A,,,--., dip): 


2. Asymptotic distribution of the maximum likelihood estimates. In 
this section we are using general properties of ML-estimates under regularity 
conditions; see, e.g., Silvey (1975, Chapters 4 and 7) and Wilks (1944, Chapter 6). 
In particular we will use the fact that the joint asymptotic distribution of the 
parameter estimates is multivariate normal, the covariance matrix being given by 
the inverse of the Fisher information matrix. The log-likelihood function of the k 
samples, up to an additive constant, is given by 


k p 
(2.1) g(A,,...,A,, BIS,,...,8,) = -$ E n, dL (logA,, + B/S.B/A, ) 

tow | ym 
(Flury 1984, formula 2.5). Assume that the B, are well defined, i.e., for each pair 
(j,/) there is at least one i€ {1,...,2} such that A,,#A,,. Let N= 
(Ai+-+) Nip) 8 = P(p — 1)/2, and denote by B* a vector composed of s func- 
tionally independent elements of B. Put n = n, +: +n, and r,=7n,/n (i= 


420 B. N FLURY 


1,..., k). Then the information matrix is 


(2.2) 





where A and G are not yet determined. 
Since À,, = B’S,B, and B, is a consistent estimate of B;, we can use the 


asymptotic (n, > oo) normality of n,8, (Muirhead, 1982, page 19) to get the 
asymptotic univariate distribution of A,, as 


(2.3) in, (A,, = Au) ~ N(0,22,). 
From (2.2), the joint asymptotic distribution of ( Mas aie N, k) has covariance 
matrix 
LOU catamaran 0 re 

1 | | 
(2.4) —V, = | — G'A`'G 

n | 

0 ---+-—~ Lar, Aj? 


Since, by (2.3), the diagonal elements of V, are (2/r Ain 2/7 Aios- - -3 2/Th Akp) 
and A is p.d.s., it follows that G = 0. Thus we get: 


THEOREM 1. The statistics nÀ, -A,,) are asymptotically (min, <, <M, 
-> ry distributed as N(0,2.,), independent of each other and independent of 
the 


JĀ" 


The asymptotic distribution of Ê requires more work. First, from the log-likeli- 
hood function (2.1) it is clear that the matrix A can be written as the sum of k 
matrices A,,...,A,, where A, is associated with the ith sample. Let V, denote 
the asymptotic covariance matrix of jn, vec Ê = yn, (i... Bry as obtained 
from the ith sample alone. Following Anderson (1963, page 130), and writing 


(2.5) se CY), 


PRINCIPAL COMPONENT ANALYSIS 421 


we get 





p 

D 8tvB,B, -eiB > = — 8B Bi 
z 

p 
(2.6) B, —821'B, Be donb, B, > —8$pBpB3 = V,- 
F 
pP 
B,, — BB, B, a p2P2B, aa » Enh nBh 
haj 


LEMMA 1. The matrices V, (i = 1,..., k) as given by (2.6) are simultaneously 
diagonalizable. 


This lemma is easily proved by showing that V, V, = V,V, for all pairs (1, A), 
using the equivalence of simultaneous diagonalizability and commutativity under 
multiplication. 

Thus there exists an orthogonal p° x p*-matrix H = (H,,H,), where H, = 
(h,,...,h,), s = p(p — 1)/2, such that 


(2.7) HVE = (7 (i=1,...,k), 


where E, = diag(e,,,...,@,,), €,, > 0. (E, has rank s because there are s func- 
tionally independent elements in B.) For the transformed variables u = HijvecB 
the information from the ith sample is therefore 


(2.8) A* = ndiag(e;',e7',...,e7') 
with 
e, = WVh,. 


The sum of these k information matrices is 


k k k 
A* = )° A* = ndiag| $ e3',..., ier! 
(2.9) p= | tet i =] 


= ndiag(e;',...,e,'), 


where e, is the harmonic mean of e,,,..., €}, The asymptotic covariance matrix 
of u is therefore diag(e,,...,e,)/n. Transforming back to B, we get the 


422 B. N. FLURY 


asymptotic covariance matrix of vec B as 


2.10 ty 5S h 
(2. ) n Ca eae 


To establish the final result, we need now the explicit form of H,. The h, are 
vectors of dimension p°. If 4 = (mj,...,1,)’ is a p? vector partitioned into p 
vectors of dimension p, we will, for simplicity, refer to , as the jth position of 7 
(which corresponds to the scalar positions (J — 1)p + 1 through jp). 


LEMMA 2. The s = : normalized characteristic vectors of V, associated 
with positive roots are as follows: For each pair of indices (jJ, D, with 1 <J < 
l < p, there exists a characteristic vector having B,/ Y2 in position j and —B,/ V2 
in postition l. All other positions are zero, and the associated eigenvalues are 
2g. m 


The proof of Lemma 2 is straightforward and need not be given. The eigenvec- 
tors defined in the lemma form the matrix H,. Assuming that all k matrices =, 
have p distinct eigenvalues and that r, > 0, the g) are all positive. From (2.8) it 
is now seen that e,„ = gy for some pair (J, /). Thus, putting 


È —ł 
(2.11) u= | pa 
i=] 


and writing h, for the eigenvector associated with the roots 2g‘;’, we get the 
asymptotic covariance matrix of yn vec B as 


p 


(2.12) V=2)) gbb 
<i 
Writing this in terms of the B-vectors, using Lemma 2, we get therefore: 


THEOREM 2. The asymptotic distribution of yn vec(B — B) 1s normal with 
mean 0 and covariance matrix V given by 





Bi Ba B, 
Dp 
È, EnBaBh -Eabb 81 BBY 
Aa 
p 
(2.13) B, -8a Bb; È EanBaBi = — 8 2pB,B: = V, 
he2 


i ; ‘ $ a 
Ê, ~8p P: B; —8p2B2B; v >D Eph BnBh 
hm] 
R 


PRINCIPAL COMPONENT ANALYSIS 423 


where the g,, are defined in (2.5), and the B, are the (common) eigenvectors of 
the k matrices È. 


The foregoing proof of Theorem 2 is based on the assumption that all k 
matrices 2, have p distinct eigenvalues. However, since g7! = n(À,; 
A.) 2AA An we can take gp ~} = 0 if A,, =A,,. In order for g, to be defied i it 
suffices to have at least one 2, with es + À q It is therefore assumed that 
Theorem 2 holds whenever CPC’s are well defined. 


3. An asymptotic test for q hypothetical eigenvectors. Using the 
asymptotic distribution theory, Anderson (1963, Appendix B) constructs a test 
for the hypothesis that the jth eigenvector of = is identical with a specified 
vector B? (BPB? = 1), under the assumption that this eigenvector corresponds to 
a root of multiplicity 1. In this section we are going to generalize Anderson’s 
result in two ways by deriving an analogous test for g specified vectors (1 < q < p) 
and k groups. Without loss of generality we can order the CPCs such that the q 
vectors to be tested are labeled 1 thru g. The null hypothesis is 


(3.1) H: (Bi By) = (B2. B2), 


where the BP are specified, mutually orthogonal and have unit length. 

The test of H, will be based on the asymptotic covariance matrix of 
vec( By... , Ê, ), that i is, the upper left pg X pq portion of V. Call this submatrix 
V(q). The elgenstructure of V(q) is given by the following theorem, which is 
actually a generalization of Lemma 2. We are again using the convention that the 
p scalar elements in positions (J — 1)p + 1 through sp of a vector are referred to 
as the jth position. 


THEOREM 3. The upper left pq X pq submatrix of V has the following 
eigenvectors and eigenvalues: 


l. @ eigenvectors (one for each pair j,l with 1 <j <1<4q) have B,/y2 in 
position j and —8B,/ v2 in position lL. All other positions are zero, and the 
associated roots are 2g,). 

2. (p — q)q eigenvectors (one for each combination of indices j,l such that 
1<jsq<lsp) have B, in position j and 0 in all other positions; the 
associated roots are g, 

3. c) eigenvectors (one for each pair of indices J, l such thatl <j <l< q) can 
be chosen to have B,/ y2 in position j, B/ V2 in position l, and zeros in all 
other posttions. The associated roots are zero. 

4. q eigenvectors (one for each j with 1 <j <q) can be chosen to have B, m 
position J and zeros elsewhere. The associated roots are zero. 


The proof of Theorem 3 is easy and is therefore omitted. We see that if g,, > 0 
forl sJ <q,1<l1< p, then V(q) has rank t = a] + g(p-—q). 


424 B. N. FLURY 


Let now ® denote a t x ¢ diagonal matrix with diagonal elements equal to the 
nonzero roots of V(q), i.e, @ = diag(26,2,...,289-1, 9) Bigi- Sgn) and let 
the columns of the pq x t matrix I be given by the characteristic vectors 
associated with the nonzero roots. Putting b, = vec(B,,.. “is ê, ) — vec(B,,..., B,), 
the random vector z = yn -T b, has a limiting normal distribution mite 
mean zero and covariance matrix I,. Thus z’z = nb T®~'T’b, has a limiting chi 
square distribution with ¢ degrees of freedom. Using Theorem 3, this expression 
can be written as 


Ê, 7 B, B, i B: 
2/2 = . r-T” ; 
B, — B, B, ~ B, 
(3.2) 


=E E g (BÊ, -BY 


j=l dmz+) 


25 3 £ gn (B BÊ, Y“. 


j=l l=q+} 


In practical applications the asymptotic covariance matrix V(q) is not known, 
but can be consistently estimated by substituting the A, , and B, for the 
respective parameters, This does not affect the peer of the asymptotic chi 
square approximation. Using Ê, instead of B, and 27'= Lh, nÀ, -À DAA A 


in the expression T- 'T” of (3. 2) we get therefore: 


THEOREM 4. Under H, as defined in (3.1), the statistic 


XH.) =n] Y ap (BBe- BB)” 


7=1 mp4] 
(3.3) 
q p 2 
+E E 87 '(B8’) 
pel fmt | 


is asymptotically distributed as chi square with q( p — (q + 1)/2) degrees of 
freedom. 


It is worth noting that (3.3) has a geometrical interpretation. The first sum 
ranges over all pairs of eigenvectors fixed under H,. If H, holds, we would expect 
Ê, and B° to be nearly orthogonal for l j, and the cosines B/B° would be 
expected close to zero. Similarly, the second sum extends over the squared cosines 
between the hypothetical vectors B” and the ee vectors B, for the p — q 
eigenvectors not considered under H, The 877’ serve as weights—which seems 
intuitively reasonable, regarding the fact that ên is large if (at least in one 
group) À, and À, are far apart. It may also be noted that Theorem 4 does not 


PRINCIPAL COMPONENT ANALYSIS 425 


require that the eigenvectors B,,,,...,B, are well defined. 
Two special cases deserve more attention. 


CasE q=1. If only one hypothetical eigenvector B? is specified, the first 
sum in (3.3) is empty, and the test statistic becomes 


E 2 
X?(H,) =n } 2p (ĝ;B?) 
[m2 


-nar End (52 4 cc — ap [ps 





(3.4) = na| n(Ral - -Azbi 
+ ASS, ~ 48,80 - T 


k 
i a| Enla 1 pa ‘BY? + An BI, By - a| 


where $ = = À BÊ; is the ML estimate of 2,. The number of degrees of 
freedom EE T with (3.4) is p— 1. If k= 1, then (3.4) reduces to the 
well-known result given by Anderson (1963, page 145). If we replace $ by S, and 
À, by Z, (the first eigenvalue of S,), we get a test for the hypothesis H; * that B? 
is the first principal component of 2,,..., 2, without specifying the CPC model. 
The test statistic 


k 
(3.5) X*(H*) = X n,(1,B2S, 'B? + 25'B)’S,B? — 2) 
tm] 
is merely the sum of k independent statistics of the form given by Anderson, and 
its asymptotic null distribution is chi square with k( p — 1) degrees of freedom. 


CASE q =p. If all common eigenvectors B},..., BP of the =, are specified, 
the second sum in (3.3) is empty. Since B? ” completely determined by 
i+- Bp-i the hypotheses H,,_, and H, are equivalent. The two associated 


statistics X *(H,,) and X*(H ) are in general not identical (unless p = 2), but 
the degrees of freedom are p( p — 1)/2 for both statistics. 


4. Asymptotic inference for eigenvalues. 


4.1. A criterion for neglecting common principal components with relatively 
small variances. Anderson (1963, page 133ff.) has shown how asymptotic 
confidence intervals for individual roots or sums of roots in one-sample PCA 
can be constructed. Since the maximum likelihood estimates A, , in CPCA are 


426 B. N. FLURY 


asymptotically independent (cf. Theorem 1), the generalization’ of Anderson’s 
results to the k-sample case is straightforward and need not be given here. 

If the main purpose of CPCA is data reduction, it is useful to have some 
criterion for discarding CPC’s with relatively small variances. Let 


q q 
(4.1) =e. C= Dag 
j=l 


JS] 


and put d, = tr E, — c, d, = tr Ê, — ĉ,. Suppose that we wish to discard the last 
p — q CPC's in population 7 if their relative contribution to the trace of £, is not 
larger than a given fraction f, (0 < fẹ < 1). Putting 


(4.2) Í, me d,/tr È, 


the asymptotic distribution of yn,[(1 — f,)d, — f,ĉ,] is normal with mean zero 
and variance 2[ ep eter +A- fD Eag]. (The use of this criterion has 
been proposed by Anderson (1963, page 135).) Estimating this variance con- 
sistently by putting in the corresponding ML estimates A,, yields 


(4.3) p o l-bdot] ee N(O, 1) 


(l fo EIA + (1 a pee ei)" 


approximately for large n, and under the hypothesis /, = fẹ. For testing the 
hypothesis that all f, (i= 1,..., k) are less than or equal to fọ, a possible 
procedure is to reject the hypothesis if 


(4.4) max z,>z, withB=1-(1- aye. 

lisk 
where 2, is the upper B quantile of the standard normal distribution. This test 
has asymptotic level a if all f, equal fp. 


4.2. A likelihood ratio test for sphericity of p — q common principal compo- 
nents. In PCA, the main motivation for testing for equality of p — q (out of p) 
characteristic roots stems from the model E = y + o°I,, where ẹ is positive 
semidefinite of rank q. In this model the last p — q characteristic roots are all 
o*, In CPCA, we can study the model Z, = , + of I, (i = 1,..., k), where the y, 
are simultaneously diagonalizable and of rank g. Then the 2, satisfy H„ and the 
last p — q CPC’s are spherical, i.e., 


(4.5) Hg: \, gap = 0 =A (i=1,...,R). 


We will refer to H, as “hypothesis of partial sphericity.” 

It may be noted that the following derivation of the likelihood ratio test holds 
as well for any subset of CPC’s, but for notational simplicity it is given in terms 
of the hypothesis H, as defined in (4.5). 


PRINCIPAL COMPONENT ANALYSIS 427 


Putting A, 94, = «1° =A, = At (i= 1,..., k), we get from (2.1) 
= Pa. n Ap BIS. S4) 


«> n a 2 (log A,, + B/S,B,/A,,) 


(4.6) r=] 
p 
© 888, x} 
Jeg 


Using the same technique as Flury (1984) the likelihood equations are obtained as 


+(p—q)log A* + 





Repo 
p| ns T ‘se |p, =0 (l<l<j<q), 


a ae 
B; L155, 8 B= (lslsq<j<p), 
(4.7) (7 ial 
A, =B/SB, (i=1,...,k&; J=1,...,q), 


p 
-| 2 S,B, [e-o (i=1,..., k), 
Jj=q+1 
with the orthogonality restrictions B/B, = 0 (1 #7), B/B, = 1. The equation sys- 
tem (4.7) can be solved using an appropriate modification of the FG algorithm 
(Flury and Gautschi, 1986). 

In contrast to the unrestricted CPC model the vectors B,,,,...,B, are not 
uniquely determined by the likelihood equations. In fact, only the “subspace 
spanned by B,,,,..., Bp is determined. Let us denote by Pess 5, Be caine a B, a 
set of érthonormal vectors solving (4.7); then the same maximum of the likeli- 
hood is obtained if we replace (Be cis: ph B.) by (Poen , Ê, )H, where H is an 
arbitrary orthogonal matrix of dimension (p — q) X (p — q). With AY, and \* 
denoting the ML estimates of the eigenvalues, the log-likelihood ratio statistic for 
H can be written as 





k SEa 9 
(4.8) Xe De aie D Marky 


3 
im] Mai t7 


where the A, , are the ML estimates for the ordinary (unrestricted) CPC 
model. Under H, the number of parameters determining %,,...,2%, is 
g(2p — q — 1)/2 + k(q + 1), compared with p(p — 1)/2 + kp parameters for 
the ordinary CPC model (see, e.g., Mardia, Kent, and Bibby (1979, page 235ff.) 
for a discussion of this problem in the one-group situation). Thus the null 
distribution of (4.8) is asymptotically chi square with ( p — q — 1X p — q + 2k)/2 
degrees of freedom. 

It may be noted that, unless k = 1, the ML estimates 8, and À,, for j < q are 
not identical with B and Rey However, we a kora Xs without 
computing the restricted solution by replacing À, A,, (J=1,...,q) and A*t 


428 B. N. FLURY 


by À* = (À, oa, + +++ +A,,)/(p — q). This yields 





k (A*)?~9 
(4.9) Xj(approx) = © n,log—— 
to] Tapki 


Since, under Hs, the likelihood is maximized for the \’s, we have always 
X (approx) > XZ. Thus the approximate statistic can be used to accept Hg, but 
not necessarily to reject it. 

It should be noted that H, does not necessarily imply a model of the form 
=, = , + 0; Lp, and partial sphericity may also occur for those CPC’s associated 
with large roots. In practical applications of CPCA it is important to make sure 
that those components that are to be interpreted are not spherical, since a 
coefficient should be interpreted only if it is well-defined. 


5. Applications. In this section some of the preceding theory is illustrated 
by a numerical example. The data used have been published by Jolicoeur and 
Mosimann (1960) and have served as an example of PCA in various textbooks 
(e.g., Morrison, 1976; Mardia. Kent, and Bibby, 1979). The main appeal of this 
example is its simplicity—the data are only three-dimensional, yet illustrate the 
purpose of CPCA clearly, which outweighs the disadvantage of rather small 
sample sizes. 

Table 1(a) displays covariance matrices S, of samples of n, + 1 = 24 male and 
N, + 1 = 24 female individuals of the species Chrysemys picta marginata (painted 
turtle). The variables are (1) log(carapace length); (2) log(carapace width); (3) 
log(carapace height). The logarithms are used instead of the measured variables 
because of their relationship to allometry; see Morrison (1976, page 295). Table 
1(b) shows the eigenvalues of the S, and the ML estimates A, ,. The value of the 
chi square statistic for Ho (Flury, 1984) is X* = 7.93 with three degrees of 
freedom, which is close to the 95% quantile of the asymptotic null distribution of 
the criterion. Regarding the relatively small sample sizes it may be reasonable to 
assume that Ho holds. 

Table 1(c) shows the estimated eigenvectors B, and the estimated asymptotic 
standard errors of their coefficients. The standard errors were obtained from the 
main diagonal of the sample analog of (2.13). It is obvious that 6, has stable 
coefficients, while B, and B, seem rather poorly defined. 

The hypothesis of allometric growth of an organism (Jolicoeur, 1963) implies 
that the first principal component of the covariance matrix of the logarithms of 
the measured dimensions is ĝi = (1,...,1)/ yp . Let us therefore test the hy- 
pothesis H, (3.1) for B? = (1,1,1)’/ V3. The statistic X?(H,) is obtained from 
(3.3) or (8.4) as X?(H,) = 46.17 with two degrees of freedom. At any reasonable 
level a we would therefore conclude that the allometric model does not hold in 
this case. 

Finally, let us see whether the second and third CPC’s are well defined, i.e., let 
us test the hypothesis of simultaneous sphericity of the second and third CPC’s. 
The null hypothesis is Hg: À, = À, (t= 1,2). Without computing the ML 
estimates under Ho, we can easily calculate the approximation (4.9) from the 


PRINCIPAL COMPONENT ANALYSIS 429 





TABLE 1 
Common principal component analysis of turtle carapace dimensions, 
transformed logarithmically. 
(a) Sample covariance matrices” 
males (n; = 23) females (nz = 23) 
1.1072 0.8019 0.8160 2.6391 2.0124 2.5443 
S, = | 0.8019 0.6417 0.6005 S, =| 2.0124 1.6190 1.9782 
0.8160 0.6005 0.6773 2.6443 1.9782 2.5899 
(b) Variances of CPC’s and eigenvalues of S, 
males Ay 2.3148 0.0729 0.0385 
eigenvalues 2.3303 0.0599 0.0360 
females Ao, 6.7135 0.0807 0.0538 
eigenvalues 6.7200 0.0751 0.0530 
(c) Coefficients of CPC's? 
0.6406 \ (0.013) — 0.3839 \ (0.182) — 0.6650 \ (0.105) 
B, = | 0.4905 | (0.015) B, = | —0.4617 | (0.201) Ês =] 0.7391 | (0.126) 
0.5907 / (0.016) — 0.7997) (0.032) 0.1075} (0.218) 











“Multiplied by 107. 
Standard errors, given in parentheses, are based on large sample theory. 


values displayed in Table 1(b). The resulting statistic is X¢(approx) = 3.24. Since 
this is far below the 95% quantile of the chi square distribution with three 
degrees of freedom, we conclude that H, is reasonable. Taking into consideration 
the relative smallness of the second and third roots in both groups, we can thus 
think of the three shell dimensions as distributed about a single principal axis 
(“size”) and two minor axes containing merely measurement errors, the main axis 
having the same orientation in space for both male and female turtles. 


6. Conclusions. In this paper we have shown how asymptotic theory can be 
used for inference on CPC models. The methods given in Sections 3 and 4 merely 
reflect the author’s opinion about which hypotheses might be important in 
practice. Other hypotheses and restrictions of the model can easily be for- 
mulated; we might for instance be interested in a model where some of the 
eigenvalues of two matrices 2, and X, are identical. Tests for such hypotheses 
could be constructed either by the likelihood ratio method or using the asymp- 
totic results of Section 2. 

One open problem deserves to be investigated: Suppose that we are interested 
only in the first q (out of p) CPC’s and wish to neglect the last p- q 
components. Then we would actually not care whether the 2, have all eigenvec- 
tors in common—it would be sufficient to know that B,,...,B, are common to 
2i1)+++, %4. This could be called a partial CPC model. 

Obviously a partial CPC model may hold even when the ordinary CPC model 
is wrong, and the test for H, may in some situations reject the hypothesis 


430 B. N. FLURY 


although the first q eigenvectors are common to all matrices 2,,..., 2,. This 
problem is currently under investigation. 


Acknowledgments. I wish to thank Professors Ingram Olkin and T. W. 
Anderson for some helpful discussions. The comments of an associate editor and 
two referees helped considerably to improve the presentation of this article. 


REFERENCES 
ANDERSON, T. W. (1963). Asymptotic theory for principal component analysis Ann. Math. Statist. 
34 122-148 
FLURY, B. N. (1984). Common principal components in k groups. J. Amer. Statist. Assoc. 79 
892-898. 


FLURY, B. N. and Garsen, W. (1986). An algorithm for sumultaneous orthogonal transformation 
of several positive definite matrices to nearly diagonal form. SIAM J. Set. Statist. 
Comput. 6. 

GIRSHICK, M. A. (1939). On the sampling theory of roots of determinantal equations. Ann. Math. 
Statst. 10 203-224. 

JOLICOEUR, P. (1963) The multivariate generalization of the allometry equation. Biometrics 19 
497—499. 

JOLICOEUR, P. and MOSIMANN, J. E. (1960). Size and shape variation in the painted turtle: A 
principal component analysis. Growth 24 339-354. 

LAWLEY, D. N. (1953). A modified method of estimation in factor analysis and some large sample 
results. Uppsala Symposuan on Psychological Factor Analysis, 35-42. Almqvist and 
Wicksell, Uppsala 

LAWLEY, ID). N. (1956). Tests of significance for the latent roots of covariance and correlation 
matrices. Etometrika 43 128-136. 

MARDIA, K V., Kent, J. T. and BIBBY, J. M. (1979). Multtariate Analysis. Academic, New York. 

Morrison, Í). F. (1976). Mulavartate Statistical Methods. McGraw-Hill, New York. 

MUIRHEAD, R. J. (1982). Aspects of Multivariate Statistical Theory. Wiley, New York. 

SILVEY, S. I). (1975) Statistical Inference. Chapman and Hall, London. 

WILKS, S. S. (1944). Mathematcal Sicustics. Princeton Univ. Press. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF BERNE 
SIDLERSTRASSE 5 

CH 3012 BERNE 
SWITZERLAND 


The Annals of Statistes 
1986 Vol ld, No 2, 431-449 


CONFIDENCE SETS FOR A MULTIVARIATE DISTRIBUTION’ 


By R. BERAN AND P. W. MILLAR 
University of California, Berkeley 


The confidence sets for a g-dimensional distribution studied in this 
paper have several attractive features: affine invariance, correct asymptotic 
level whatever the actual distmbution may be, numerical feasibility, and a 
local asymptotic minimax optimality property. When dimension g equals 
one, the confidence sets reduce to the usual Kolmogorov—Smirnov confidence 
bands, except that critical values are determined by bootstrapping. 


1. Introduction. Kolmogorov—Smirnov confidence bands for a one-dimen- 
sional cdf have two attractive features: affine invariance and distribution-free 
critical values over the class of all continuous cdf’s. Neither property is retained 
by the analogous confidence sets for a multivariate distribution based on the 
qg-dimensional Kolmogorov—Smirnov statistic (q > 2). Studied in this paper is an 
alternative multivariate version of the one-dimensional Kolmogorov—Smirnov 
confidence band which preserves affine invariance, has correct asymptotic level, 
and makes equally good sense whether the actual distribution of the data is 
discrete, possesses a Lebesgue density, or is singular with respect to Lebesgue 
measure, 


1.1. Half-spaces and confidence sets. Let |-| and ¢(-,-) denote, respec- 
tively, euclidean norm and inner product in R, bet Dr (s € RT: |s| = 1} be 
the unit sphere in R7. For every (s, t) E S, X R, let A(s, t) be the half-space 


(1.1) A(s, t) = (x € R%: (s,x) < t}. 


Let # be the set of all probability measures defined on the Borel sets of R”. The 
half-spaces Y= {A(s, t): (8, t) E S, X R} separate probabilities in the sense 
that, if P,Q € Z and P(A) = Q(A) for every A € ¥, then P,Q agree on all 
Borel sets (Cramér and Wold, 1936). The class ” is a Vapnik—Cervonenkis class 
of index q + 1 (e.g., Dudley, 1978) and is invariant under affine transformation 
of RY, 

Consider the distance between P, Q € # defined by 


(1.2)  d(P,Q) = sup{|P(A(s, t)) — Q(A(s, ¢))|: (s, t) E S, x R}. 


Introduced into statistics by Wolfowitz (1954), the half-space distance d has 
reemerged in recent discussions of projection pursuit (Diaconis and Freedman, 
1984; Huber, 1985). Let Ê, be the empirical distribution of x,, x5,...,%,5 iid. 


Received June 1984; revised June 1985. 

' Research supported by National Science Foundation Grant MCS84-03239. 

AMS 1980 subject classifications. 62G05, 62H12. 

Key words and phrases. Confidence set, multivariate distribution, affine invariance, local asymp- 
totic minimax, bootstrap. 


431 


432 R. BERAN AND P. W. MILLAR 


R?-valued random vectors with unknown distribution P € 2. The proposed 
confidence set for P has the form {Q & 2: dÈ, Q) <c}. 

How is the critical value c to be chosen so that the confidence set has level 
1 — a? Bounds for P"[d(P,, P) > c] obtained by Vapnik and Červonenkis 
(1971), Devroye (1982), Alexander (1984) appear far too conservative to yield 
accurate values of c. Numerical evidence for this assertion is presented in Section 
3; see also Huber (1985). Direct asymptotic approximations appear no more 
useful, though for different reasons. Let L be the set of all bounded measurable 
functions on S, X R, metrized by the supremum norm || - ||. The o-algebra in Ly 
is that generated by open balls. Let W = {W(s, t): (s,t)@S,x R} be a 
Gaussian process with mean zero covariance function 


E|W(s, t)W(s’, t’)] = P| A(s, t) O A(s’, tO] — P[A(s, t)] P[A(s’, t), 
and sample paths in La. Then 
(1.3) L [n aÈ, PP] = (|W) 


(cf. Dudley, 1978). In general, the cdf of this limit law depends upon the 
unknown distribution P and is not tractable. 


1.2. Bootstrap confidence sets. A bootstrap construction (cf. Efron, 1979) for 
the critical value c avoids some of the difficulties. Let ¢,(a, P) denote an upper 
a-point of [n'/"d(P,, P)|P"]. Define the confidence set 


(1.4) C,(a, P,) = {Q E€ P: n'd(P,,Q) < t,(a, B,)}. 


(For a more precise description of ¢,(a, P), see Beran, 1984.) A triangular array 
version of weak convergence (1.3), derivable from a result in Le Cam (1983), 
implies that lim, ,,,P”[C,(a, P,) 3 P] = 1 — a; that is, the asymptotic level of 
confidence set C,(a, P,)is 1 — a. Theorem 2 in Section 2 gives a stronger version 
of this result. 

Definition (1.2) leads to an exact algorithm for computing d( P, Q) when both 
P and Q are supported on a finite set of cardinality m; the number of mathemati- 
cal operations required is of order 29(™). When dimension q is small, this 
algorithm can be used to compute Monte Carlo approximations to ¢,(a, È ) and 
to determine whether a given distribution Q lies in confidence set C,,(a, b). In 
the latter application, Q is first replaced by a discrete approximation. 

When dimension q is larger, confidence sets for P based upon a stochastic 
approximation to d become attractive for computational reasons. Let 
81, Sortes Sh, be iid. random unit vectors, uniformly distributed on S, and 
independent of the {x,; 1 < ¿ < n}. For P, Q € F, define 


(1.5) d (P,Q)= max sup {|P(A(s,, t)) — Q(A(s,, t)i}. 
lsk<sk, tEeR 
Random selection of the {s,; 1 < k < k„} has some advantages over systematic 


selection. The condition lim, ...k, = œ ensures that lim, .,d,(P,Q) = 
d( P,Q) w.p. 1, the rate of convergence in probability of d,(P,@) to d(P, Q) 


CONFIDENCE SETS FOR A MULTIVARIATE DISTRIBUTION 433 


being exponential in k,. More precisely, let p denote the uniform distribution on 
S, and let Y(s) = sup, P[A(s, ¢)] ~ Q[A(s, ¢)]]. Then, for every positive e, 


(1.6) uld (P,Q) > (1 — e)d(P,Q)] =1- [1 -= b(e)]", 


where b(e) = p[Y(s,) > (1 — e)d(P, Q)]. 

Let 8, = (Si, 82,---, Sp ) and let u,(a, P,s,,) denote an upper a-point of the 
conditional distribution of n'd(P,, P) under P”, given s,. Consider the 
modified bootstrap confidence set 


(1.7) C(a, È) = {Q € P: nid (P,,Q) < u,(a, Pio) 
The distribution of Č (a, È.) depends upon the joint distribution of the observa- 
tions {x,; 1 <is<n} and of the search sample s,. H lm, kn = œ, the 


asymptotic level of Č (a, È) is 1 — a (Theorem 3 in Section 2). The number of 
mathematical operations required to compute d,(P,@) depends linearly upon 
the product &,,q. 


1.3. Confidence sets and risk. In general, the problem of confidence set 
construction has a natural decision theoretic formulation, which views it as a 
set-valued estimation problem, subject to the level constraint (see Beran and 
Millar, 1985). For confidence set C,(a, Ê, ), the decision space treated in this 
formulation is the collection of all balls C(z, r) = {Q € P: d(Q,z) <r} with 
center z € # and radius r. If Z,, R„ are estimates of center and radius based 
upon the n observations, then the loss function for the confidence set C(Z,,, Rp) 
is taken to be 


(1.8) L(Z,,R,;P)=n'/* sup d(Q, P) 
QEC Zn, Rp) 


or a monotone function thereof. Evidently, this loss penalizes for excessive size or 
miscentering of C(Z,, R,,). The risk of C(Z,, R,„) is then 


(1.9) Pa(Zns Rn; P) = ie Rp; P) dP". 


Among all confidence sets of the form just described whose asymptotic level is at 
least 1 — a, the confidence set C,(a, È), defined in (1.4), is locally asymptoti- 
cally minimax (Theorems 1 and 2 in Section 2). Moreover, the risk (1.9) of 
C, (a, Ê,) can itself be estimated from the data. 


2. Asymptotic properties of confidence sets. This section formulates and 
proves the three theorems described in the introduction. Notation introduced 
there is retained. 


2.1. Local asymptotic munimax bound for corfidence sets. Any probability 
P € # can be regarded as an element of L by identifying P with the function 
which maps (s, t) € S, X R into P(A(s, ¢)). With this identification, d( P, Q) = 
||P — QI] for every P, Q € P, where ||- || is supremum norm on S, x R. 


434 R. BERAN AND P.W MILLAR 


Let £ (n, a) denote the collection of all confidence sets C(Z,, R), Z, and R, 
depending on x), X2,.-.,X p, which satisfy the criterion of asymptotic level 
Lana. 


(2.1) liminf P”|C(Z,,R,) > P| >1-—a forevery PEF. 


Define a norm |- |; by |P— Q|, = sup{/P(A AN B)- Qi AN B)|: ABEX) 
for P,Q € P. Let #(n,c, P) be the set of all probabilities Q € P such that 
IQ — Pls < ca,, where 0 < c < œ and a, is a preselected sequence of real 
numbers subject to the constraints a, > n7, a, | 0. 

Let 2, be the distribution of the Gaussian process W described in Section 1, 
viewed as a random element of L,. Let p, be the risk of confidence set 
C(Z,,, R,,) as defined in (1.9). 


THEOREM 1. Fix a E (0,1) and P € P. Then 


(2.2) 
lim liminf inf an oael ra 
>r nox C(Z,, R) EY(n, a) gee ne Q) fl | ) of ) 


where the constant r is determined by 
(2.3) A [lz sr]=1- a. 


This theorem rests, in part, upon the abstract Hajek—Le Cam lower bound for 
minimax risk of a decision procedure. Theorem 1 remains valid for various 
restrictions of #: for instance, if # is replaced by the set of all probabilities 
supported on the unit sphere in R9; or if # is replaced by the set of all 
probabilities supported on a finite subset of R4. Section 3 discusses examples of 
these two situations. 


PrRoor. ‘Theorem 1 will be deduced from (4.5) of Beran and Millar (1985), 
cited hereafter as BM. We also draw on results in Millar (1983)—these to be 
referenced by notations of the form X.2 (Chapter X, Section 2). Let f be the 
density of P with respect to some o finite measure v. Let H be the Hilbert space 
of real functions A on R” such that fhfdv =0, fh°fdv < œ, support h c 
support f. Let H, be the subset of H consisting of all A such that f(1 + n7'/*h) 
is a probability density for all sufficiently large n. Then H, 1s dense in H. Define 
t, a map from H to L,, by (TAXA) = f,hfdv, A E ¥. Let B be the closure of 
tH in the supremum norm ||- ||, so B c L,. Calculations as in V.2 show that 
(7, H, B) is an abstract Wiener space; its canonical Gaussian measure on L,, is, 
in fact, 2». 

Let {2,, h € H} be the Gaussian shift experiment (V.3) associated with 
(7, H, B). If Pf is the n-fold product measure of P, 12,(dx) = f(x)[1 + 
n~'/*h(x)]v(dx) then {P?, h e H,} converges to {2,, h € H,} (see V1.1). 
Define € of Section 4 of BM by €(P?) = P, 12, E La- lf &’ is the linear operator 
on L, given by ¢’x = x then clearly (PR) = &(Pj') + ¿rn '/*h), so (4.2) of 
BM holds. Since the set of measures over which the supremum in (2.2) above is 


CONFIDENCE SETS FOR A MULTIVARIATE DISTRIBUTION 435 


taken contains the measures {P, 12} |h| < c}, the theorem follows from the 
locally asymptotically minimax lower bound, Theorem 4.5, in BM. O 


2.2. Confidence set C,(a;, P) is LAM. This subsection demonstrates, in par- 
ticular, that the bootstrap confidence set C,(a, P,) defined in (1.4) has asymp- 
totic level 1 — a, uniformly over |- |; compacts in #. Moreover, C,(a, P,) is 
locally asymptotically minimax (LAM) in the sense that its maximum risk over 
¥(n,c, P) attains the lower bound for minimax risk given in Theorem 1. 


THEOREM 2. Fix a © (0,1) and P€ ÊP. If lim, _.,|P, — Pl; = 0, then 


(2.4) lim P”[C (a, Ê) 3 P]=1-a 
and 

(2.5) t (a, È) +r in Př-probability, 
where r is defined by (2.3). 


Moreover, for every positive c 


(2.6) lim sup p,(P,,n7'/t,(a, P,); Q) = f(z + 7) 2o( dz). 


nee QEF(n,c, P) 


PROOF. Suppose {P, € #; n 21} satisfies the hypothesis of the theorem. 
From Proposition 1 of Section 4, 


(2.7) L |n’ PÈ, — PMB] = L(\WI)- 


Since the limit law has a continuous, strictly monotone cdf (Proposition 2 of 
Section 4), it follows that 


(2.8) lim t,(a, P,) =r. 


_ The Vapnik and Cervonenkis (1971) inequality implies, in particular, that 
IP, — P|, > 0 in P”-probability. This convergence and (2.8) yield (2.5). In turn, 
(2.5) and (2.7) imply (2.4) (cf. Beran, 1984). 

By (2.7) and the exponential bound of Alexander (1984), nPE pall P, — Pi 
converges to E||W||. Since {P,} could be chosen to be an arbitrary sequence in 
F(n,c, P), a straightforward argument yields (2.6), as follows: 

lim sup oÈ, n~'/*t (a, Ê); Q) 


n> QeFin,c, P) 


= lim p,(P,, ne (es P,); P, 


(2.9) Hai l 
= lim {nE pllÈ, — Pll + Ept,(0, P,)) 
= EJW] + r. : 


2.4. Confidence set C (a, Ê, ) has asymptotic level 1 — a. This subsection 
establishes that the computationally simpler confidence set C,,(a, P,) defined in 
(1.7) has asymptotic level 1 — a and that its critical value u,(a, P, 8) converges 


436 R. BERAN AND P W. MILLAR 


in probability to r. Both convergences are uniform over i | compacts in P. Let 
p denote the uniform eun on S, and let P” x p°” designate the product 
measure generated by P” and pt” 


THEOREM 3. Fix a € (0,1) and PE2. If lim,|P,— Pl, =0 and 


lim, .%, = 0, then 

(2.10) lim (Py xu \[C (a, P,) a P]=1-a 
and 

(2.11) u,(a,P,,8,) >r in(P? x p*)-probability, 
where r ts defined by (2.3). 


PROOF. Suppose {P, €E F; n > 1} and {k,; n = 1} satisfy the hypothesis of 
the theorem. Let W,(s, t) = n/p [A(s, t)] — P,[A(s, t)]}. Since W, = W as 
random elements of L,, (Proposition 1 of Section 4), there exist versions of {W,,}, 
W such that lim, |W, — W|| = 0 for every realization (Wichura, 1970). 

Fix the realization of (W,} and W. Let Z,(s) = sup, W,(s, ¢)| and Z(s) = 
sup,| W(s, ¢)|. From above, lim sup, _,,.{|Z,(8) ~ 2(s)|: s © Sj} = 0. From this 
and the evident convergence w.p. 1(4) of max{Z(s,):1 < k < k,} to esssup,Z(s), 
it follows that 


(2.12) lim max Z,(s,) me aa w.p. 1(p). 


noo lsksk, 


Let &[max, < p < p SUP, Wal Sk t)||8,, Ph ] denote the conditional distribution of 
the first argument, given s,. In view of (2.12), 


2| max sup |W,(s,, t)||s,,, Pr 

Lsksk, r 

(2.13) 

=> 2| esssup sup |W(s, t) w.p. 1(z) 
j t 


for the original versions of {W,} and W. 
Let {t,; k = 1} be a fixed countable dense subset of R. With probability 1(u), 
the random set {s,; k 2 1} is a dense subset of SY. For every n, 


ess sup sup |W,,(s, £)| = sup sup |W,,(s,, ¢)| 
H i k21 & 
(2.14) = sup |W,(s,, ¢,)| w.p-1(x) 
kzi 


= |W]. 


The final equality holds because ||W,|| equals the supremum of W,(s, £) over any 
countable dense subset of S, x R. Letting n tend to infinity in (2.14) proves 


(2.15) £ | esssup sup|W(s, t)|| = 2 [WI]. 
ji i 


CONFIDENCE SETS FOR A MULTIVARIATE DISTRIBUTION 437 


Combining (2.18) with (2.15) yields the convergences: 
u,,( a, P,,8,) WE w.p. 1(p) 


and “|! max sup | W,(s,, t) ||P” x ure => L [wW]. 
sks, i 
Theorem 3 follows. 0 


2.4. Extensions. The asymptotic theory of Theorems 1 to 3 can be extended 
in several directions. 


Estimation of risk. The risk p,( D, n-'/*t (a, P, ); P) of confidence set 
C(a, P,) has the bootstrap estimate p,(P,,n~'/*t,(a, P,); P,). Convergence in 
probability of this estimate to the actual risk can be proved using the triangular 
array reasoning of Theorem 2. A simpler risk estimate, which relies on the 
asymptotic constancy (2.5) of t (a, P ) and on formula (2.9), is the mean of the 
bootstrap distribution for ni’? 2d(P., P) plus t,(a, È .) This in turn can be 
approximated by the mean of the bootstrap distribution for n'/2d AL,, P) plus 
u,a, Ê, s„); see the proof of Theorem 3. In principle, risk estimates provide a 
means for directly comparing confidence set C (æ, P,) with other confidence sets 
of the same asymptotic level. 


Other roots for confidence sets. Alternative confidence sets for P can be 
obtained from the weighted metric 


|P,[A(s, £)] - PIAC, Dl] 
w [PIA(s, 1) ran Aa 


of Anderson-Darling type. The supremum norm in (2.16) or in the definition of 
d( P,, P) may be replaced by other norms, such as the L (m) norm on S, x R,m 
s a finite measure. The asymptotic theory for confidence sets C, (a. È fe and 

C.( a, Ê, ) can be extended to the analogous confidence sets based on these 
literate roots. L,,(m) norms with o finite m can also be treated, at the price of 
reducing the class : of possible distributions. 


(2.16) 


Confidence sets for the difference of two distributions. Let P,Q be probabili- 
ties in P. Suppose x,, X»,...,x,, are lid. (P) and Yi, %,--., Yn, are iid. (Q), 
the two samples being adenendlent. Let Ê, and ĝ, be the empirical distributions 
of the (x, 1 < i < nı} and {y:1 SJ < no}, respectively, where n = n, + n, and 
n/n =à asn>o,0<A<1. Let t (a, P,Q) denote an upper a-point of 


Fini“ P, 2 )- (P — Q)|||P™ x Q":]. Define the confidence set 


Ban) P,,Q,) = {P- Q: |(P- Q) - Q,,)| 


< n~t (a, P,,0,); P, Qe P). 
The development of this paper, including correct asymptotic level, LAM 


438 R. BERAN AND P.W MILLAR 


optimality, and computationally feasible variants can be carried through for this 
confidence set. 


3. Numerical study. Further insight into the applicability and performance 
of confidence sets C,(a, Ê) and Č a, B) is gained from three numerical case 
studies. 


3.1. First case study: five dimensional data. Mardia, Kent, and Bibby (1979) 
reported test scores for n = 88 college students, each of whom took two closed 
book and three open book tests. Marginally, the scores for each test appear to be 
normally distributed (perhaps the consequence of a grading curve). Is it reason- 
able to approximate the joint distribution of the five test scores by a normal 
distribution? 

The Monte Carlo approximation to the conditional bootstrap distribution for 
d,,( Ê, P), given s,, recorded in Table 1, was calculated from 200 bootstrap 
samples, with d, defined by &, = 200 randomly generated unit vectors in R*. 
Two points are noteworthy: (a) The bootstrap distribution of d,(P,, P) in this 
example is necessarily supported on {7/88: 0 <J < 88}. The Monte Carlo 
approximation in Table 1 is supported on {7/88: 8 <7 < 22}, the cdf having 
sizable jumps. Convergence of the conditional bootstrap distribution to its con- 
tinuous limit (Theorem 3) may not be swift. (b) The standard inequal- 
ities for P"[ d( È, P) > c] are extremely conservative here. Both the 
Vapnik-Červonenkis (1971) and Devroye (1982) inequalities yield the trivial 
conclusion P”[d( P, P) > 0.250] < 1, in contrast to Table 1. Alexander’s (1984, 
Theorem 2.11) inequality is not even applicable for ¢ < 8(88)~'/* = 0.853. 

From Table 1, the confidence set 


(3.1) Č (0.930, P,) = {Q € P: d,(Q, P,) < 0.193}, 


with n = 88, has approximate level 0.93. Let N, denote the normal distribution 
on R” whose mean and covariance matrix are given by the sample mean and 
sample covarianc2 matrix of the 88 test score vectors. For the d, of the previous 
paragraph, d,(P., Ñ) = 0.129. The multivariate normal model for the test score 


TABLE 1 


Monte Carlo approxunation to the bootstrap distribution of d,,{ P, P), guen s,, for the 
fice-dimenstonal test score data. The number of random directions used ts 200, the number 
of bootstrap samples drawn is 200. 


x 0.091 0.102 0.114 0.125 0.136 0.148 0.159 0.170 


Bootstrap 
edf at x 3.010 0.025 0.080 0.185 0.345 0.525 0.685 0.805 


x 0.182 0.193 0.205 0.216 0.227 0.238 0.250 


Bootstrap 
edf at x 0.885 0 930 0.970 0.980 0 985 0.994 1.000 


8460 6060 1880 0860 «a60 ogo LFO 060 6ee0 G90 20 ro OO 100 
6460 ZE60 e480 S860 LeeO EOSO LEO S60 680 PFO EO M0 I9 4o) 
L860 S60 cero 860 eso  I88O0 9860 LSO 1480 FO FCO Z0 ro ro 
£860 L60 L680 ¿860 göro 8680 S860 go 6680 7o 70o z0 co z0 


60 g0 ogo 660 W0 060 6o W0 060 (‘d} sapppqeqosd 
S[OAI| [BUTUION SSAG] [BUTUION SPAR] [PU MON vuono 
Og = u OF = u OZ m u 
‘902 9 728 


3JUIpUUOI YIÐI JIN.4YSUOD OF pasn saydumms do.yszooq fo 4aqumu ayy ‘9 unsuaunp fo pun pnuounmw 
s oop ay, Cq DYO Jas avuapfuos dogs4ooq ay} fo spu OD) AUOW QOQI UI 57203] paas1asgQ 
€ Wav, 


[enjoB8 sj peeoxe 07 [9A9] [eUTWIOU əy} Jo AOUspUay quaredde əy} SI Ayproməzou 
Ajjfenby “‘quepiAs si seseeloul u əzis aldures se [eae] ;BUTWIOU 0} [eda] [eNjOV 
Jo aouaSraAu09 ay} ‘Tews Azo si {'d } əy} Jo asou Jo au UaYyA uaa ‘Tenbe pe are 
{o 5 L5 1 “d} sayyiqeqoid əy} uym 4saq S1 Cd ‘v "2D 498 BoUaPTUOD Əy} Jo SPAI] 
jenjoe pue peuruou usamgəq e əqeg, Ul JUSWOeIAE əy} ‘pepedxe aq yru sy 
ool St 
“d} ayy 107 [eq uəpyuoo peyeLDosse oy} əupuragəp os pwe |d — “("’d)|'"0% 
aoueysip-'7 əy} IO} uonnquysıp deaysjooq e dn ping o} Jepio ul ‘uonq 
-gsp (MM) (Bg) (“Id Su) TerMOUTY NUT oyy woy UMBIP 13M 81047994A 
00% pue pozemnopeo orem {gs fs q :“(“d)} sonmqeqord euioojno pəgeungsə 
ayy (z) 10,09 atdures yons yəgə 104 ‘payereues orem {0001 5 wu S q :“(z)} 
sioqoan (fd ‘+d Id fu) yermouynjnu goor ‘paepwuco (95/51 d) 
satyyiqeqoid eu10djno jo 10J0eA YoRe IOq ‘Og ‘OF ‘OZ = U UY s10}00A WopuBl 
[BIWIOUT] [NUL [BUOTSUSTUIp-aAy 10} ApNyS Oe IJUOW B JO si[Nser ayy sodA e 
a198L V — T 1249] ooqdurfse 841 y} “(p'T) ul peuyep “( d ‘0 )"2) 798 aouapyuoo JO 
[Pag] enpor əy} aredwoo o} Ayrunyroddo uv saptaoid uoyeoyydwis orerqasye st, 
‘id — ““d|'753 -3 = (d qp ‘Aguanbasuog ‘g pue “g usemjeq aoueysip wou 
UOTBIIBA syg 0} [Tenbe SI (Aq "dP aouBysIp Lds-Jey I} UOYLENpyS SUJ UJ 
‘f awoogzno Jo Aouenbaly 3AL 
ayy st prm ‘u/z =“ = [{‘a}]"g m ‘J uo poysoddns sp {u 525 1 x} 
24} jo “gq uonnqguysip pondwa ayy pue “xri =z ‘Iəasoarow b> 5 | 
“d = [{}]d £q uaa st pue py yo {b 5 fs {<a} = J yesqns əy} uo peyoddns 
SI d UOIINGLYSTP ISOM "xX 10709A WOPUBI V SI [BLY moug oyeeANU y}? 
əy} JO sUl00JNO at} ‘Uat.], ‘0 IE are syUaUOdUIOD AsYy}O əsoym puer T SI JUeUOdUIOD 
yif esoyM p ut ‘a 10}09A aU} se payUaserder sı / əawoomo alqissod esoddng 
STEL} u SY} ur sM090 f awoyo səy Jo JequmMu ay} SI “Z aTquLtea 
wopuBl ayy, “d Suteq / autooqno jo AyyIqeqoad əy} ‘smoo0 səwuoogno əjqıssod 
b jo əuo ‘euy yogə 4y ‘“SMOT[O] se seung moug azyeLTeA[NU JUepuedepur 
u WOT} ISLE 0} JYSnoyy oq uwo uoynquysp ( d ‘+: ‘td “Id tu) pruouymu əy} 
woy (fz ‘tetz Tz) = z aduws y ‘oop pomogo :Apms uonpmuns Eg 


'Surunsuoo-oung Area sreədde—suonnqLysIp 19481 [V 1940 sesuBl 
A S8 (a TYP jo onpa ununura 94} Buynduios Aq— ATeaIsnpouos uoysenb siy} 


UVTTHW Md ANV NVH H OFF 


000 | G66 0 C860 0680 890 SECO Vry v po 
drisjoog 


9F'0 OF'0 St°0 og'o 92'0 0¢°0 sto x 


QOZ St UNDAP 
sajdums dp.4j8j00q9 jo 4aquinu ay) ‘007 81 pasn suoyoasip wopuDns fo qumu ayy Pp Pupo ap 
ayy 40] “8 uang “gy “qp jo uoyngiynp doysoog oy; 07 uoyDunxosddn opin) muo 


Z TGV 


BUTJJOS SIOP prm uOPnAEIP JUST IYO autos aq JUST araug “(“ d 986 0)" 2 
ur ər JOU səop “WY M “A mion UMBIP QOOT 8218 Jo əjdues oe IJUOW B Jo 
uOINGLYSIp [eoLndwie ayy yy af suede ysry Aq peyeurrxoidds sem aoueysip 
SUL ‘ero = Cag) P uoy} ‘pooyyeyy umwew Aq odures əy} 04 pany 
UOI{NGLIASIP Jeysiy I4} sejyouep ‘A JI ‘S860 1049] o}eurxoidde sey ‘Qg = u UM 


‘{ae'0 5 (J ‘0)"P 6 2 0} = (“a ‘986°0)"9 (ze) 


qos əvuəpyuo OUT, 

G SABE Crue nA PStP 
au} Jo {6 S [5 £ :0¢/f} yaoddns əy} ul sanpea 48 ajqvordde you sı Ayyenboul 
gS Japuexaly əya ‘T > [970 < (d “d \Plud punog peny ey ppor sanenbaur 
əÁoxao q pus SONGA sue A au} ‘qserjUOD UJ *.y Ul S104094 Jun perouag 
Ajwopusl 00g = “y Aq pəugyəp “p yım tsaydures dexjsjooq 00g Woy pee] 
-NOBI sem ‘g a[quy, Ul paprosar “s uals (q “g)"p 10} uonnquysıp de.xys}00q 
ay} 0} uorjyeurxoidde own aquoyy au, FF uo sdeo ia Jo UOT}da][00 
Əy} Sper .y JO saoeds-j[ey pe yy fg Jo uoTyesreyUT æ Aq peoulder s! gw J 
pea sumwa gz Uong Jo Aroay} oyozdusse aU], “E D d uoynqguysıp umouyun 
ug wog əjduws Wopuel B uoz suorgeatasqo 0g əy} əsoddng *.y ur ateyds 
yun a3 ‘fG uo pəyroddns are yorym suornnquysıp Jre JO Jas əy} BJOUaP `E Je] 

‘Ayisuap 
Tasty Əy} Jo Aeus pere 94} UPI JUa\stTsUCOUI A[snoLIdsns uoremsyuoo 
B ‘aizeyds yrun ay} JO eoBjMs ay} UO qsəzsN o ayT-esesnes g ULIO] SUOIZBAIASgO Iy] 
Fey} sayeotpul (gTg aded ‘ZLET EPI) jo[d Jeu-jpruypsg Y ZUONGLISIp IYS 
B WOW ajduws g se ggep əy} prear 04 o[QBUOSs¥aI 71 ST ‘“atayds yun g JO soRjINS 
94} UO synod S8 10 ,.y Ul 8104004 run se peJUesaidal aq UBO SJUIWAAINSEIWU [BUO!} 
“darip ƏSƏ, “BUTWIOAM Ul UOTBULIOS yesem 24} JO JequioW syng [BIpeyywH 
ausoo 94} Ul Setpog əuogzspuegs woy dip pue UNWIZe Jo sJUIWAMSLYIU pIq-SSOII 
0% = u peyiodar (Z967) ZJAWUIS DIDP JOUCYaLIp :ÁpPS aspd PUOIIŞ FE 


'OGL/IT XVA 8 UO moy əuo jage 

-rxoidde yoo} “ISWI Aq pepiaoid sreaqumu wopusiopnesd pue seurjynoigns aou 
-uug-aorogowoy suisn ‘LL NVH.LHOJ ul ‘ejdurexe sty} 10} suoreynduiog 

(189 JY-JO-SsaupoOod [BOISSsBIO B JO JBY} 10U SI sey Suruosvat 

au.) ÄN Ajeursu ‘uoynqiysip peurou auo 4s¥e] 4B SUTBUOD q 10} Cd‘ OLE" 02 

hems qəs AYWOMISNI} 9U2 eyg əsuəs əy} Ul ‘ALOPRIsIVes stvedde s10jOaA 


6EP NOLLOYRLLSIC ALVRIVALD IAW V HOd SLAS GONACIANOO 


CONFIDENCE SETS FOR A MULTIVARIATE DISTRIBUTION 441 


level of the confidence regions (34 cases out of 36 in Table 3). It remains to be 
seen if bootstrap critical values can be refined to reduce this effect. 


4, Empirical process results. This section collects facts about the em- 
pirical process W, and its limit distribution which are needed for the proofs in 
Section 2. 


4.1, Convergence of the empirical process W,,. Let X1,, Xony+++)X,, De Lid. 
random vectors in R9, each having distribution P, € 2. Let È, be the empirical 
measure of the {x,,: 1<i<n} and let W(s, t) = n'7(P[A(s, t)] — 
P [A(s, t)]}. Recall the norm | - |, on L, defined after display (2.1). 


PROPOSITION 1. Suppose lim, |P, — Pls = 0 for some distribution P € 2P. 
The empirical processes {W,; n = 1} converge weakly, as random elements of 
Lo: to a Gaussian process W on S, X R having mean zero and covariance 
function 

E|W(s, t)W(s’, t’)] = P[A(s, t) 9 A(s’, ¢’)] 
(4.1) 
-P[A(s, t)] PLA(s’, t’)] 
for (s, t),(8’, t’) mS, X R. 


This triangular array weak convergence result may be deduced from Le Cam 
(1983). Le Cam’s Lemma 4, together with his analysis of M(F, 2) at the bottom 
of p. 317, implies the equicontinuity property: For every € > 0 and 7 > 0 there 
exists y > 0 such that 


(4.2) limsupP”| sup |W,(s,t¢) — W,(s’, t’)|> a] <e, 
n= 00 G(n, y) 


where 


(4.3) G(n,y) = {(s, t),(s’,t’) E S, x R: P,[A(s, £) AA(s’, t’)] <y}. 


(An elementary argument involving the definition of Le Cam’s 2, converts the 
supremum in Le Cam’s lemma to ours.) The convergence lim, .. iP, — Pla = 0 
permits replacement of G(n, y) in (4.2) by 


(4.4) G(O0, Y) = f(s, t), (s^, t) E S, x R: P[A(s, t) AA(s’, t’)] <y}. 


This yields the classical criterion for tightness of the processes {W,: n > 1} in 
L~- The proposition follows immediately. 


4.2. The maximum of a Gaussian process. The proofs of Theorems 2 and 3 
required the fact that the random variable ||W|| has a strictly increasing 
continuous cdf. This fact follows at once from the more general Proposition 2, 
which is useful in the analysis of other bootstrap procedures as well. 


442 R. BERAN AND P. W. MILLAR 


PROPOSITION 2. Let (1, H, B) be an abstract Wiener space with P, the 
canonical normal distribution on B. If | - |is the norm of B, then under P, the 
random variable z > |z|, defined on B, has a density and a strictly increasing 
cdf on [0, œ). 


Proor. The result will be deduced from a theorem of Tsirel’son (1975). 

Let B* be the dual of B. By the Hahn-Banach theorem, |z| = sup m(z), 
where the supremum is computed over m € B*, |m| = 1. Since B is separable, 
this supremum can be computed over a countable set of m. Since each random 
variable z > m(z) is Gaussian under P,, we therefore may analyze |z| as the 
maximum of a countable collection of mean zero Gaussian random variables. 

According to T'sirel’son (Theorem 1), |z| has a continuous distribution except 
possibly at the point a, = inf{a: P,{|z| < a} > 0}. Let us show first that the 
distribution of |z| has no atom at a = 0. Since |z| = 0 iff z = 0, it suffices to 
show that P, has no atom at 0 € B. Let {P,, h © H} be the Gaussian shift 
family for (r, H, B), so that P,( 4) = P(A — rh). If P, had an atom at 0 € B, 
then because of the mutual absolute continuity of the P,, P, would have an 
atom at each point rh € B; this is an uncountable collection of atoms, which is 
impossible. Thus P, has no atom at 0 (and by the same argument, none at any 
other point). 

On the other hand, every ball in B centred at 0 must have positive P, 
probability; this is immediate from a straightforward extension of Anderson’s 
lemma (Anderson, 1955) to Gaussian measures on Banach space. By the previous 
paragraph, the cdf of |z| must be continuous everywhere. Since every ball about 
0 has positive F, probability, the mutual absolute continuity of the {P,} shows 
that every ball (center arbitrary) has positive probability. This implies that the 
cdf of |z| is strictly increasing. The existence of the density of |z| follows from 
another part of T’sirel’son’s Theorem 1. This completes the proof. 0 


REMARK. Facts cited on page 854 of T'sirel’son’s paper assert that the 
density of |z] is strictly positive at every point of (0, co). 


REFERENCES 


ALEXANDER, K. (1984). Probability inequalities for empirical processes and a law of the interated 
logarithm. Ann. Probab. 12 1041-1067. 

ANDERSON, T. W. (1965). The integral of a symmetric unimodal function. Proc. Amer. Math. Soc. 6 
1970-1976. 

BERAN, R. (1984). Bootstrap methods in statistics. Jber. Deutsch. Math.- Verein. 86 14-30. 

BERAN, R. J and MILLAR, P. W. (1985). Asymptotic theory of confidence sets. In Proc. Berkeley 
Conf. in Honor of J. Neyman and J. Kuefer 2 (L. Le Cam and R, Olshen, eds.) 865-887. 
Wadsworth, Monterey, Caf. 

CRAMER, H. and Wo Lp, H. (1936). Some theorems on distribution functions. J. London Math. Soc. 
11 290-294. 

DEVROYE, L. (1982). Bounds for the uniform deviation of empirical measures. J. Multivariate Anal, 
12 72-79. 

Draconis, P. and FREEDMAN, D. (1984). Asymptoties of graphical projection pursuit. Ann. Statist. 
12 793-815. 

Dubey, R. M. (1978). Central limit theorem for empirical measures. Ann. Probab. 6 899-929. 


CONFIDENCE SETS FOR A MULTIVARIATE DISTRIBUTION 443 


EFRON, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7 1-26. 

HUBER, P. J. (1985). Projection pursuit. Ann. Statist. 13 435—475. 

LE CaM, IL. (1983). A remark on empirical measures. In Festschrift for Erich Lehmann (P. J. Bickel, 
K. Doksum, and J. L. Hodges, eds.) Wadsworth, Belmont, Calif, 

MARDIA, K. V. (1972). Statistics of Directional Data. Academic, New York. 

Maropia, K. V., KENT, J. T. and Brissy, J. M. (1979). Multivariate Analysis. Academic, New York. 

MILLAR, P. W. (1983). The minimax principle in asymptotic statistical theory. Lecture Notes un 
Mathematcs 976 75-265. Springer, Berlin. 

STEINMETZ, R. (1962). Analysis of vectorial data. Jour. Sed. Petr. 32 801-812. 

TSIREL’SON, V. S. (1975). The density of the distribution of the maximum of a Gaussian process. 
Theory Probab. Appl. 20 847-856. 

VAPNIK, V. N. and CERVONENKIS, A. YU. (1971). On the uniform convergence of relative frequencies 
of events to their probabilities. Theory Probab, Appl. 16 264-280. 

WICHURA, M. J. (1970). On the construction of almost uniformly convergent random variables with 
given weakly convergent image laws. Ann. Math. Statist. 41 284-291. 

WOoLFowITz, J. (1954). Generalization of the theorem of Glivenko—Cantelli. Ann. Math. Statist. 25 
131-138. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF CALIFORNIA 
BERKELEY, CALIFORNIA 94720 


The Manal of Statutes 
1986, Vol 14, No 2, 444-460 


IMPROVED CONFIDENCE SETS FOR THE COEFFICIENTS OF 
A LINEAR MODEL WITH SPHERICALLY SYMMETRIC 
ERRORS 


By JIUNN Tzon HWANG! AND JEESEN CHEN 


Cornell University and University of Cincinnat 


Under the assumption of normal errors, confidence spheres for p (p > 3) 
coefficients of a linear model centered at the positive part James—Stein 
estimators were recently proved, by Hwang and Casella, tc dominate the 
usual confidence set with the same radius. In this paper, the same domination 
results are established under various spherically symmetric distributions. 
These distributions include uniform distributions, double exponential distn- 
butions, and multivariate ¢ distributions. 


1. Introduction. For a standard linear model 
(1.1) X= A @ +oe, 


nxi nXp px} nxi 
assume that the design matrix A has a full rank p, e has an n-variate normal 
distribution N(0, J), and hence oe ~ N(0,0°I). For the confidence set problem, 
the unequal variance case can be transformed to this equal variance case 
when the ratios of variances are known. We focus here on the simpler situation 
where the variance o? is known. The usual 1 — a confidence set for 0 is 


(1.2) Cy , = {0: (0 — Ô)(A'A)(8 — 8) < ca"), 


where 6 = (A’A)7~! A’X is the least squares estimator, and c? is chosen so that 
P(x? <c?)=1-a. 

Even though Cy , enjoys many optimal properties (i.e., best invariant; mini- 
max; admissible for p < 2), it is inadmissible when p > 3. Thus there exists a 
confidence set that dominates Cy ,, i.e., which has the same volume but has 
higher coverage probabilities than Cx , for every @. In fact, the James—Stein 
confidence sets Cj, , are such confidence sets, where 


(1.3) Cys, 9 = (0: (8 — 85g) (A‘A)(9 — 855) < co?) 
and 
(1.4) S35 = b + 1- aaae] (6 — 6). 


The point estimator 5,5, with a being some positive number, is the positive part 


Received March 1983; revised October 1985. 

'Research supported by the NSF Grant Number MCS-8003568. 

AMS 1980 subject classyications. Primary 62F25, 62F35; secondary 62.07, 62C20. 

Key words and phrases Coverage probability, uniform distribution, double exponential distnbu- 
tion, multivariate ¢ distribution, spherically symmetric distributions. 


444 


IMPROVED CONFIDENCE SETS 445 


James-Stein estimator (1961) shrinking toward a prior guess 6, of 8. The 
existence of confidence sets dominating Cy , was proved independently by 
Brown (1966) and Joshi (1967). Constructive results were given in Hwang and 
Casella (1984), which imply domination for any positive a less than a specific 
bound. The upper bound that is derived analytically depends on c? and p but is 
about 0.8(p — 2) when Cy, has 0.9 coverage probability and p = 20. The 
largest a so that domination maintains is analytically proved to be no greater 
than 2( p ~ 2) but is close to being 2( p — 2) according to their numerical study. 
The numerical study in an earlier article by Hwang and Casella (1982) shows 
that the largest improvement in probability (which occurs at @ = 6,) can be 
substantial. The history of this problem was discussed therein. 

In this paper, we consider the same model as in (1.1), except that ¢ is assumed 
to have a spherically symmetric distribution. That is, 


(1.5) X = A + oe, e~ f(el?) pd. 


(Normally the variance of any component of e is taken to be 1.) We establish 
results for a general f with special emphasis on ¢ distributions, double exponen- 
tial distributions, and spherical uniform distributions. For a t distribution (with 
N degrees of freedom), 

1 —(N+p)/2 
(1.6) f (jel?) = constant 1 + sie" 


for an exponential distribution (with parameter k) 
(1.7) f (lel?) = constant e~ 7", 


The specific probability density function of the spherical uniform distribution is 
given in (2.6). Model (1.5) has been considered in the literature. In particular, 
when e has a ¢ distribution, it was used by authors cited in Zellner (1976) to 
model some practical situations. Under model (1.5), the usual least squares 
estimator and its corresponding tests were studied to some extent by Thomas 
(1970), Zellner (1976), and Box (1952 and 1953) and others. See Chmielewski 
(1981) for an excellent survey. 

Note, for virtually any distribution having @ as a location parameter, Brown 
(1966) has proved, for p > 3, the existence of confidence sets dominating Cy ,. 
However, no constructive results have been obtained. 

In this paper, we prove under (1.5) the domination of Cyg, over Cy , for a 
less than a given bound. The upper bound, in general, depends on the underlying 
distribution. However, for many classes of flat tailed distributions that include 
multivariate ¢ distributions, double exponential distributions, and uniform distri- 
butions, the corresponding upper bounds are greater or approximately equal to 
ayn, the upper bound for the normal case. Therefore, the superiority of Cy. , 
over Cy , as proved by Hwang and Casella (1934) for the normal case, has been 
broadened to these classes of distributions. 

In Section 2, we prove a general theorem (Theorem 2.2) that relates the 
domination result of Cy, , over Cy , to a measure of flatness of the assumed 
distribution. See also Corollary 5.2. Sections 3 and 4 contain some stronger 


446 J.T. HWANG AND J. CHEN 


results for the ¢ distributions and double exponential distributions. Section 5 
provides conditions on a which are necessary and are nearly sufficient for the 
domination of Cy, , over Cy ,. This is based on the study of the coverage 
probability of Cj, , when |0] is large. In many cases, the largest possible a for 
domination is at least 2( p — 2). 

Note that our theorems also establish the minimaxity of Cy, , for a wide class 
of spherically symmetric distributions since Cyg , is superior to Cy ,, which is 
proved to be minimax by Hooper (1982). 

Previously, in the context of point estimation, the point estimators of the 
form 64, ,, have been shown in Strawderman (1974), Berger (1975), Brandwein 
(1979), and Brandwein and Strawderman (1978 and 1980} to dominate 6 for 
various spherically symmetric distributions and various losses. Our results here 
can also be considered to be their counterpart, namely the establishment of 
improved confidence sets associated with their point estimators. 


2. Domination results and their implications. To study the model (1.5), 
we can assume without loss of generality that it has a canonical form representa- 


tion, le, A = (“"), where A, is a p X p nonsingular matrix. Furthermore, we 
apply a linear transformation 


ô > ĝ* = (AA) (8-0) and @ > 6* = (4A1A,) (0 — bo) 


and note that @* has the same distribution as E, + @* where e, is the vector of 
the first p components of £. The distribution of £, is also spherical. Even though 
the p.d.f. of e, is obviously different from that of e, it will be denoted as f(-) 
below. However, if e has a n-dimensional ¢ distribution with N degrees of 
freedom, then £, has a p-dimensional ¢ distribution with the same degree N. By 
using the new variables 6* and @*, we can now assume without loss of generality 
that model (1.5) is 


KUFE 
where X, is the first p components of X. Also, since o? is known, we can assume 


without loss of generality that it is 1. Suppressing the subscripts p in X, and e, 
leads us to write the model (1.5) as 


(2.1) X=89+e, e~ff(lel?) pd. 
Now Cy , and Cyc , are reduced, respectively, to 

(2.2) Cy = {6:|8- X? s c°}, 
and 

(2.3) Cza = (0: |0 — (XP < c?}, 
where 


a + 
Hae ee 
(a | rd 


> 


IMPROVED CONFIDENCE SETS 447 


Note that in all the examples considered in this paper, f is decreasing. 
(Throughout all the paper, decreasing means nonincreasing. Similarly, increasing 
is equivalent to nondecreasing.) Hence Cy is best among the invariant set 
estimators in the sense that Cy has minimum volume among the invariant set 
estimators with coverage probability at least 1 — a = P(@ € Cy). Furthermore, 
C, is minimax. That is Cy minimizes the maximum of the volumes among all 
the set estimators with confidence coefficient (or minimum coverage probability) 
at least 1 — a. See Hooper (1982, Theorems 1 and 2). 

In this section, we will develop theorems that relate the domination of Cza 
over Cy to a quantity that measures the flatness of f. Readers more interested in 
¢ or exponentially distributed error may read the next sections first without 
much difficulty. 


DEFINITION 2.1. The quantity f(s)/f(s), when it is defined, is called the 
relative increasing rate (RIR) of f at s. 


The RIR of f measures the rate of increase of f relative to f and is usually 
negative. If f has a large RIR, f dies out to zero slowly and consequently f has 
a heavy tail. On the other hand, if f is very small (or very negative), f has a 
sharp tail that dies out to zero quickly. 

In the special case that X ~ N(@,07I), the RIR is a constant function equal 
to ~(207)~'. The following theorem states the domination results and will be 
proved at the end of this section. 


THEOREM 2.2. Assume that the RIR of f(s) is defined for every s, a) < s < 
a,, where 


T a= |(c- va)*]?” and 


a,=c*+a. 


If a > 0 us such that 





f(s) -(p-2) |e+ve?+a 
is n<i<a f(8) ~ 2era | Va | 


then the coverage probability of C,. is higher than Cy for every 8. Since C,. has 
the same volume as Cy, Cze dominates Cy. 


The following corollary gives some insight about the a’s that satisfy (2.5). The 
proof is straightforward and is omitted. 


COROLLARY 2.3. When p > 2, the solutions of a to the inequality (2.5) form 
an interval (0, ap] where a, > 0. If the left-hand side of (2.5) is continuous in a, 
then a, ts the untque solution to (2.5) with the inequality replaced by an equality. 


448 J.T HWANG AND J. CHEN 


When X ~ N(6, 07), the left-hand side of (2.5) is then the RIR, —(207)7!. 
Theorem 2.2 and Corollary 2.3 then reduce to Theorem 2.1 of Hwang and Casella 
(1984) which is stronger than the domination result in Hwang and Casella (1982). 
Using a programmable calculator, one can calculate the upper bound a). For the 
normal case with o? = 1, the numerical values of ap, denoted as ay, were 
reported for selected values of c? and p in Hwang and Casella (1984) and are 
reported in Table 1 (N = oo) for convenience of further discussions. 

Theorem 2.2 and Corollary 2.3 apply to virtually any spherically symmetric 
distributions. In applying these theorems, the calculations are usually straight- 
forward. See Hwang and Chen (1983) for results concerning other distributions. 


TABLE 1 


Values of bounds a, for domination under multivariate t distri- 
bution with N degrees of freedom For concentence of compar- 
ing to the normal case, c? were chosen so that Px; < c) = 
0.90 Values of c? are given in Table 2. 


N pos p=5 p= p= 
l 0.866 1.887 2.702 3.438 
3 0.792 2.081 3,003 3.811 
5 0746 2.220 3.240 4.123 
7 0716 2.324 3.431 4.384 
9 0.696 2.401 3.586 4.605 

10 0.687 2.396 3.653 4.703 

15 0.659 2.345 3.913 5.098 

20 0.642 2 310 4.019 5.380 

25 0.632 2.284 3.9%) 5591 

30 0.618 2.250 3.947 5.652 

45 0.614 2.228 3.917 5.621 

59 0.605 2.213 3,896 5.596 

90 0.596 2.184 3.850 5.040 

OG 0.580 2 132 3.760 5.413 

N p= p= 165 p= 20 p25 
l 4.132 5 449 7.018 8.539 
3 4.556 5 940 7.560 9.112 
5 4.924 6.385 8.065 9.657 
7 5.242 6.787 8.535 10.172 
9 5.519 7.150 8.971 10.658 

10 5.644 7.318 9.117 10.890 

15 6.162 8.043 10.097 11.954 

20 6.547 8.616 10.863 12.870 

25 6.843 9.076 11.503 13.661 

35 7.265 9.764 12.508 14.946 

45 7 323 10.251 13.252 15.937 

55 7,229 10 613 13 824 16 718 

90 7.240 10.645 14.882 18.489 


oe 7.079 10.434 14.653 18.890 


IMPROVED CONFIDENCE SETS 449 


For double exponential distributions and ¢ distributions, the a, are small. 
Fortunately, these upper bounds are enlarged in Sections 3 and 4. Even though 
Theorem 2.2 is weak for these special cases, it reveals clearly the relationship 
between the domination result and the RIR. For another perhaps more signifi- 
cant connection, see the first paragraph after Corollary 5.2. 

Theorem 2.2 asserts that if the RIR of f is uniformly bounded below by a 
certain bound, depending on p, c, and a, Cza dominates Cy. Therefore, the 
message is clear: Stein’s set estimator dominates C, if the tail is heavy enough. 
This is probably due to the fact that James—Stein estimator is a shrinkage 
estimator. 

We can also apply Theorem 2.2 to a spherical uniform distribution. 


COROLLARY 2.4 (p > 2) (Uniform distribution over a sphere centered at the 
origin with known radius). Suppose that the p.df. of e = X — 0 ıs 


(2.6) f (lel?) = constant if |e| < R, 


= 0 otherwise. 
Then C,. dominates Cy if 0 < a < (R? — c°). 


In deriving Corollary 2.4, note that (2.5) is automatically satisfied for p > 2 as 
long as the RIR is well defined. This is equivalent to a < R? — c°. Also note that 
if the true distribution is uniform and if Cy has coverage probability 1 — a < 1 
then c? < R? and hence the condition on a is not vacuous. A striking feature is 
that even for p = 2, Cy can be improved by Corollary 2.4. This is not very 
surprising in light of the fact that the best location invariant set estimator is not 
unique which implies that Cy can be uniformly improved even for p = 1 as 
shown in Farrell (1964). 

The remainder of this section will be devoted to the proof of Theorem 2.2. We 
will need the following two lemmas which will also be useful in dealing with the ¢ 
distributions and double exponential distributions in Sections 3 and 4. 

Assume as in (2.1) that the p.d.f. of X is f(|x — 6|*). To prove the domination 
of Cyu over Cy, we follow the technique developed in Hwang and Casella (1984). 
We consider two regions of @: |@| < c and |@| > c. For the first case, we have the 
following lemma. 


LEMMA 2.5. For |6| < canda > 0, 
(2.7) P(6 E Cy.) > P(0 € Cy). 


ProoFr. The proof is similar to the proof of Theorem 2.1 in Hwang and 
Casella (1982) and is hence omitted. 0 


Below we need only focus on the situation |6| > c. For such a region a formula 
for ð/ða P(8 E Cza) is established in Lemma 2.6. Note that the domination 


450 J.T. HWANG AND J. CHEN 
results can be proved for a € (0, a,] if one can show that for every a € (0, ay], 
g 
— P 8 & C a > 0, 
5 PUB € Coe) 


since this implies that for all a € (0, a,| 
P(0 E€ Cy.) > lim P(0 € Cy) = P(8 € Cy), 
a)’ 


due to the continuity of P(@ € C,.) as a function of a. 


Define 
a + 
u(r) = £ - 5) ; 
a(r) =a(r,B) =r? — 2rlĝjcos B + 18}?, 
-2 
(2.8) 0-27] f sint) dt, 
r=} "0 
and 
fe T 
Bo = sin a Ta 





Also let r | be solutions to 


r u(r) = |@\cos 8 + ye? — |6|?sin® 8 
(2.9) dee 


Le., 


(2.10) r,(a,0, B) = (ro + V(r} + 4a )/2. 


Using these notations and a spherical transformation, one can write 


(2.11) P(O € Cpa) = af” [T resin? -38f(alr)) dr dp. 
0. or, 


Now the derivative formula of Hwang and Casella (1984) can be generalized to 
this case. The straightforward proof, which is omitted, is based on interchanging 
the order of differentiation and integration, and the fundamental theorem of 
calculus. . 


LEMMA 2.6. Assume that |9| > c and that f(a(r)) ıs a continuous function 
on the set of (r, B) such thatr_ <r <r, and 0 < B < Bp. Then 


d fo 
(2.12) —P,(0 € Cy) = Q {"m(a, 8, B) dB, 
da 0 


where 


refla(r,)) _ ref(a(r_)) | 


2.1 ,0, B) = sin? : 5 
(2.13) ep) = a8 e| rita r?+a 


IMPROVED CONFIDENCE SETS 451 


PROOF OF THEOREM 2.2. By Lemma 2.6., we need only show that for all £, 
0 < B < Bp and 9, |6| >c, 


m(a,6,B) > 0, 
which is clearly equivalent to 
r,\P*fla(r,))  1+ar;* 
coer 
Since r, > r_ for all 0 < B < Bp, (2.14) could be established if one could show 


alee 


| flalr)) >” 


(2.14) 


r 


— 


(2.15) 


or equivalently, 
Ps 

(2.16) a(a(r.)) — e(a(r_)) = -(p - 2)in—, 
where g(s) = In f(s). By the mean value theorem, g(a(r..)) — g(a(r_)) equals 
g'(s)(a(r,) — a(r_)) for some number s between a(r_) and e(r). Now 
(2.17) a(r,) -—a(r_) =(r,-—r_)(r, + r_ — 2/6jcos B). 
From (2.9), 

2lðjcos B = r2+r°. 
Since (2.9) and (2.10) imply r, > r_> va, 


a \* a 
u(r.) = 1~—> Se ets 
+ 


which, together with (2.9), imply that 
a 
2|@\cosB =r, --~-+r_-—. 
r, 2 
Substituting this expression for 2|@|cos 8 in (2.17) shows that 
| Se oe 
(2.18) a(r,)} ~ a(r_) =a|—-—]|]>0 
Rs F; 


and that (2.16) is equivalent to 


1 
ag'(s)[t~ =) = =(P - Bogt, 
where ¢ denotes r,/r_. The last inequality can be established if we require that 


pH -' log ¢ 
(2.19) inf g’(s)2- in 
a(r_}<s<atr,) a [@|>¢ e= 
0< B< Ba 


Differentiating the function (log ¢)/(¢ — 1/t), dropping the denominator, and 








452 J.T. HWANG AND J. CHEN 


differentiating the numerator again show that 





logt cont 
(2.20) is decreasing in t, t> 1. 
t-—1/t 
Hence the right-hand side of (2.19) attains its infimum at t = ¢*, where 
det 
t*= supt. 
\O|>¢ 
o< <fa 


It can be shown as on page 9 of Hwang and Casella (1984) that ż is decreasing in 
6 and |6|, and consequently 


c+ yc +a 


(2.21) f* = t| i= c E o 
B=0 va 
and 
j! 2e 
[* — — = =-=. 
t* Va 


Therefore (2.19) is equivalent to 


inf 13) 210 
wa ee ) 2cva 


Comparing this inequality with (2.5) and noting gs) = {(s)/f(s), we would 
have established Theorem 2.2 if we could show 


(2.22) a= inf a(r_) 
j> e 
O<B<p 

and 

(2.23) a,= sup a(r,), 
|A|>c 
O<B<£f, 


where a, and a, were given in the statement of Theorem 2.2. These two 
equations are established in Lemmas A.3 and A.4 of Hwang and Chen (1983) and 
Theorem 2.2 is proved. O 


For a multivariate ¢ distribution and a double exponential distribution, 
Theorem 2.2 can be strengthened, which will be the goals of the next two 
sections. 


3. Refinement of the domination results for the multivariate t distribu- 
tion. In many distributions including norma! distributions, ¢ distributions, and 
the exponential distributions, the RIR f’(s)/f(s) is an increasing function. (In 
other words, In f is convex or f is log convex.) In this section we take advantage 
of such a fact and derive some stronger theorems. Even though we mainly focus 


IMPROVED CONFIDENCE SETS 453 


on ¢ distributions, the ideas can be grasped more easily if the general results are 
presented first. 


THEOREM 3.1. Assume that X ~ f(|x — 6|*) and f is log convex. Let a, and 
t* be as in (2.4) and (2.21). Then C,. dominates C, provided that 
(3.1) inf A(t) > —(p-— 2), 
l1<tsi* 
where 


g(a(t— t7') + ay) — glao) 


h(t) = e 


and g(s)= ln f(s). 


Proor. Applying (2.18) and using the notation t = r,/r_ show that (2.16) is 
equivalent to 


g(a(r_) + alt- t™))-glalr_)) 
In t 
Since a(r_) > a and g is convex, the left-hand side is greater than or equal to 
g(a(t — t!) + ag) — glao) 
In £ 
This together with (3.1) imply the theorem. O 


(3.2) 2 +(p .2): 


Minimizing A(t) is usually quite difficult. However, with the help of the 
following Lemma 3.2, it can be solved for ¢ distributions and double exponential 
distributions. An empty set and a single point set are called degenerate intervals. 


LEMMA 3.2. Assume that g is differentiable. Suppose the set of t> 1 such 
that 
p-2 
t 


is an interval ( possibly degenerate or possibly with infinite length). Then (3.1) is 
equivalent to (i) h(1*) = —(p — 2) and (ii) A(t*) > —(p — 2). 





(3.3) +ag’(a(t—t')+a,)(1 +7) 20 


Proor. Clearly (i) and (ii) are necessary for (3.1). To prove that (i) and (ii) 
are sufficient, note that (3.1) is equivalent to 


(3.4) (p—2)Int+ g(a(t—t7') +a,)—-g(a)) 20, Vt1stst*. 


The derivative of the left-hand side of (3.4) is exactly the left-hand side of (3.3) 
which by assumption has an interval solution. If this interval is degenerate, then 
(i) and (ii) clearly imply (3.4) and this lemma is proved. If this interval is not 
degenerate, let À}, Aa, 1 <A, <A. < œ be the endpoints. Below, we show that 
A, = 1 and hence the left-hand side of (3.4) is increasing for t € [1,A,] and is 
decreasing for t > À. This should imply that (3.4) holds for t, 1 < t < t*, since 


494 J.T. HWANG AND J. CHEN 


(3.4) holds for ¢ = 1 by trivial observation and also holds for t = t* by condition 
(11). This would have established this lemma. 

To show A, = 1, all we need to do is to show that (3.3) is satisfied for ¢ = 1. 
Now condition (i) and |’Hospital’s rule imply 


—(p- 2) < h(1*) = atg'(a(t — t) + ao)(1 tt?) 
which is equivalent to (3.3) for t= 1. 0 
Theorem 3.1 and Lemma 3.2 can be applied to the ¢ distributions and the 
double exponential distributions and yield domination results stronger than 


Theorem 2.2. Here, we concentrate on ¢ distributions, since the results for the 
double exponential distributions can be further improved in the next section. 


COROLLARY 3.3 (Multivariate £ distribution). For p> 2, C,. uniformly 
dominates Cy provided 0 < a < a,, where a, = min(a,, a,), 


= mi (2 =) + 0 NFO iy 2 
a, = min} eè, E= c c Pa + 2) 

















l = 5 
Iaa” )se’, 
ay PG a otherwise, 
and a, is the unique solution to 
e+ ve? + a \MPm NP 2cVa , 
ya N + [2cVa + [(e- va)" ]’} 





PrRooF. For multivariate ¢ distributions, (3.3) is equivalent to 


(3.5 ue (1 +t? ae 0 
3. CANE Gta) ree 


or 
def 1 1‘ m 
u(t) = [+ 5] -m|t-7] < — (N +a), 
j a 


where m = 2( p — 2)/n + p. Clearly the derivative of u(t) is increasing in t and 
hence u(t) is convex. Therefore the solutions to (3.5) form an interval. Applying 
Theorem 3.1 and Lemma 3.2, and solving conditions (i) and (ii) yield this 
corollary. 0 


Numerical values of a, for the multivariate ¢ distribution are reported in 
Table 1. Note in Table 1, that the a, are, in many cases, comparable with ay,, 
whose values were given in the same table under N = oo. Hence the domination 


IMPROVED CONFIDENCE SETS 459 


results for the normal case as established in Hwang and Casella (1984) hold for 
many multivariate ¢ distributions by Corollary 3.3. Numerical studies in Hwang 
(1983) and the asymptotic results in Section 5 show that under multivariate £ 
distributions C,. usually dominates Cy even when a = % p — 2). 


4. Refinements of the domination results for the double exponential 
distribution. Even though the technique developed in previous sections ap- 
plies also to the double exponential distribution and yields larger a's than 
Theorem 2.2, we can do even better by another approach to be described here. In 
establishing the domination results, as before, all we have to do is to establish a 
sufficient condition for (2.16) or equivalently 


(4.1) ee — aa E 


The difficult in deriving a sufficient condition for (4.1) is that L depends on |6| 
and 8 in a fairly complicated manner and consequently the minimum over all |8| 
and £ is hard to find. Under the condition of the following lemma, we were able 
to show that L is minimized at B = 0 and hence the remaining minimization 
problem involves only |8| which is considerably simpler. 


LEMMA 4.1. Assume that g(t) is convex and decreasing and g’ is concave. 
Then for every |0], L(\@|, B) ts increasing in B and consequently for every 0, 
L(\0|, 8) ts minimized at $ = 0. 


Proof. Write 


(4.2) P(A, 8) = ary) ar) 


def 
= R Rg. 


g(a(r,)) — g(a(r_)) | 7 | 


log t 


From (2.18), R, = a(t — t7')/logt and by (2.20), R, increases as £ increases. 
Since ¢ decreases as 8 increases, so does R,. Now because R, is nonpositive, to 
establish the lemma, it suffices to show that — R, is decreasing or R, is 
increasing in £. This can be shown as in Lemma 3.5 of Hwang and Chen (1983). 
The arguments are fairly technical and are omitted. O 


The assumptions of Lemma 4.1 are satisfied for multivariate t distributions 
and double exponential distributions. Now under the assumptions of Lemma 4.1, 
a sufficient condition for domination is 


L(|6|,0) > —(p—2) for every |6| >c, 
which by (2.9) and (2.10), is equivalent to 


a((r. — 1800?) - e((r_ — 181°) 


ee) log(r,/r_) 


= =p 2), 


B=0 


456 J.T. HWANG AND J. CHEN 


where 


Filas” z fjøl c+ y(llêl + e)” F 4al. 


If one could find the minimum of the left-hand side of (4.3), in general, one would 
have established domination results stronger than Theorem 2.2 and Corollary 
3.3. However, solving this minimization problem, in general, is very difficult. So 
far, we have only been successful for the double exponential distribution. The 
result is reported in the following theorem. 


THEOREM 4.2 (Double exponential distribution). For p > 2, Cz. uniformly 


def 
dominates Cy provided 0 < a < min(c?, a3) = a,» Where a, is the unique solu- 
tion to 


(4.4) 


e + Ve? + a) PT2 
va | 


Proor. Note that g(s) = —kys. To establish (4.3), we show that under the 
condition a < c*, L(|6|,0) is minimized at |@| = c. Since the condition L(c,0) > 
—({p — 2) is equivalent to 0 < a < a}, we will have established the theorem. 
Now similar to (4.2), write 


exp{—vye* +a te-ya}=1 


alt — t!) g(a(r,)) — g(a(r_)) 








L(10\,0) = 
(161, 0) log t a(r,) — a(r_) 
p=0 
a(t-t7') (-K) 
logt = |r, — ||| + ir- — |All PEN 


Note that r, > [@| and since, a < c?, that r_ < |0|. Therefore 
ir, — {Ol + Ir. — JØ] g-0 = —7- le-o 
which equals 
(4.5) $(|6| + c) — (18l — c), 
where $(s) = [s + Vs? + 4a 1/2. Since $(s) is convex, (4.5) increases in |@|. The 
function — L(|6|.0) thus decreases in |8], since a(t — t~')/log t increases in t by 


(2.20) and t decreases in |§|. This implies that L(|8|, 0) increases in |6| and the 
theorem follows. O 


Note that Theorem 4.2 is very strong in that it specifies a very large upper 
bound on a. This bound, ap, is probably very close to the best bound that one 
can establish using the technique of Hwang and Casella (1984). As in Table 2, for 
k= 1, a, is larger than p — 2, the traditional choice of a in the point 
estimation problem. If one considers the k so that the common variance is 1, 
then k = yp + 1. (In general the variance equals ( p + 1)/k”.) Fork = yp +1, 
a}, is also reported in Table 2 and is less than but close to ay, as reported in the 


$ 


IMPROVED CONFIDENCE SETS 457 


TABLE 2 


Values of bounds for domination. The c* are chosen so that 
Pixi se”) = 0.9. 


NO, 1) k=l k=p+i 


c aNL ag ag 
3 6.251 0.580 1.448 0.643 
4 7.179 1.339 3.408 1.401 
5 9.236 2.132 5.650 2.158 
6 10.645 2.942 8.129 2.906 
7 12.017 3.760 10.804 3.646 
8 13.362 4.585 j 4.378 
9 14.684 5.413 R 5.103 
10 15.987 6.245 : 5.822 
11 17.275 7.079 R 6.537 
12 18.549 7915 . 7.246 
13 19.812 8.754 s 7.952 
14 21.064 9.593 j 8.654 
15 22.307 10.434 y 9.353 
16 23.542 11.276 . 10.050 
17 24.769 12,119 z 10.748 
18 25.989 12.963 i 11.435 
19 27.204 13.808 . 12.124 
20 28,412 14.653 i 12.811 
21 29.615 15.500 j 13.496 
22 30.813 16.346 ü 14.180 
23 32.007 17.194 i 14.862 
24 33.196 18.042 j 15.542 
25 34.382 18.890 a 16.211 


“The value is the same as the value of c* in the first column and 
the sarne row. 


same table. Hence most of the domination results for the normal case established 
in Theorem 2.1 of Hwang and Casella (1984) stand under the double exponential 
distribution with common variance 1. 

It is unfortunate that we failed to establish a theorem similar to Theorem 4.2 
for the multivariate t distribution. The corresponding expression on the left-hand 
side of (4.3) becomes very messy and we cannot find the minimum. If we were 
able to show that the minimum occurs at || = c, we would have shown that 
domination results hold for a < a, rather than a < min(a,, a.) as needed in 
Corollary 3.3. This would establish a larger interval of a for domination. 
However, numerical study also shows that |@| = c is not the minimum point of 
the left-hand side of (4.3) unless there are further conditions on a and c’. 


5. A necessary and nearly sufficient condition for the domination of C,. 
over Cy. In the previous sections, we provided sufficient conditions for C,. to 
dominate Cy. In most cases, a was less than p — 2, the traditional choice of a in 
the James—Stein point estimator. Here, we provide some evidence that the range 
of a is probably at least twice as large as what was given earlier. 


458 J.T. HWANG AND J. CHEN 


In this section, we use an asymptotic formula (5.1) (as |@| —> oo) to derive 
necessary conditions (Corollary 5.2). We provide some evidence that these 
necessary conditions are close to be sufficient. Theorem 5.1, generalizing Theo- 
rem 3.1 of Hwang and Casella (1984) for the normal case, can be proved by using 
Taylor expansions and some tricky applications of integration by parts. For the 
details, see Hwang (1983). 


THEOREM 5.1 {p z 2). Assume that f(t) exists and is continuous in t, 
0O<t<c*. If 











(i) f #YP)a¥|\< œ, 
lYjse 
(ii) | Je d¥PE YP) dy < 00, 
and 
(iii) lim ¢?f(t*) = lim ¢?f’(t?) = 0, 
t+0 t-+0 
then, as |@| > œ 
Qac”? 
(5.1) P(0 E€ Cy) =l-at Daz (P — 2) f(c?) + af(c?)| + o(1017?), 


where 1 — a = P(@ € Cy) and Q is as in (2.8). 


The assumptions in Theorem 5.1 are not restrictive and are satisfied by 
normal, double exponential, and multivariate t distributions. Theorem 5.1 can be 
used to provide conditions necessary for the domination of Cza over Cy. 


COROLLARY 5.2. Under the assumptions of Theorem 5.1, necessary condi- 
tions for the domination of Cs. over Cy are a > 0 and 


(5.2) af’(c?) + (p ~— 2)f(c?) = 0. 


Proor. Obviously if Ca dominates Cy then a > 0. (Otherwise if a < 0, C,. 
has smaller coverage probability than Cy at |@| = 0 and if a = 0, Cza and Cy are 
identical.) Inequality (5.2) follows directly from (5.1). O 


If f(c?) > 0, condition (5.2) is then equivalent to 

f ’(¢*) pe 

a ee : 

f(e”) a 

Note that the left-hand side is exactly the RIR. Therefore, (5.3) implies domina- 
tion for large |6| if the tail of the underlying distribution is flat enough. 


Due to the shrinkage nature of 5°, one expects that for small |@!, 6° and C,. 
perform better than X and Cy, respectively. In fact, by Lemma 2.5, the coverage 


(5.3) 





IMPROVED CONFIDENCE SETS 459 


probability of C,. is higher than Cy for |@| < c as long as a > 0. Therefore, it is 
moderate |@| that are of concern. 

For moderate |f|, one can expect the coverage probability to be reasonable. 
Exact numerical computations of the coverage probabilities of Cza performed in 
Hwang (1983), using (2.11) show that this is the case. It turns out that if a 
satisfies the necessary conditions in Corollary 5.2, then C,. is close to dominating 
Cy. In fact, with the addition of the condition 


1 T a <s P(0 € Cza) 
(5.4) = (2.11) with |8| replaced by c in the definition 


of Ba and r,, 





Gmc 


these would become sufficient according to the numerical studies in Hwang 
(1983), and, given (5.2), (5.4) is not much of a restriction. 

Next we apply Corollary 5.2 to various distributions. The necessary condition 
is 0 <a < 2(p — 2) for N(@, I); 0 < a < 2(p — 2)c/k for the double exponen- 
tial distribution (2.4); and 0 < a < {p — 2(N+4+c*)/(N +p) for a multi- 
variate ¢ distribution. If we have chosen c according to a normal distribution 
(i.e., P(x), <c)=1-— a), c? is larger than p (unless 1 — «æ is smaller than 0.6) 
and hence the upper bound for the multivariate t distribution is larger than 
2( p — 2), i.e., the bound for the N(@, IT) distribution. If k = 1, c is usually larger 
than 1 (unless 1 — a is less than 0.01); hence similar conclusion holds for this 
double exponential distribution. Even if k = yp +1 (so that the component 
variance is 1), the upper bound for the exponential distribution is usually larger 
than 2( p — 2) (unless 1 — a is less than 0.75). Hence, again, the domination 
results for the normal case usually hold for the multivariate ¢ distribution and 
the double exponential distribution with k = 1 or k = yp +1. 


Acknowledgments. The authors wish to thank Professors James O. Berger, 
Lawrence D. Brown, and George Casella, and the referees for their valuable 
comments which helped to substantially improve this paper over the previous 
draft. The authors thank Professor Peter Hooper for pointing out some refer- 
ences. The help of Jing-Huei Chen in many calculations is also highly appreci- 
ated. 


REFERENCES 


BERGER, «J. O. (1975). Minimax estimation of location vectors for a wide class of densities. Ann, 
Statist. 3 1318-1328. 

Box, G. E. P. (1952). Multi-factor designs of first order. Biometrika 39 49-57. 

Box, G. E. P. (1953). Spherical Distributions (Abstract). Ann. Math. Statist. 24 687-688 

BRANDWEI, A. C. (1979). Minimax estimation of the mean of spherically symmetnec distributions 
under general quadratic loss. J. Multivariate Anal. 9 579-688, 

BRANDWEIN, A. C. and STRAWDERMAN, W. E. (1978). Minimax estimation of location parameters 
for spherically symmetric unimodal distributions under quadratic loss. Ann. Statist. 6 
377-416. 

BRANDWEIN, A. C. and STRAWDERMAN, W E. (1980) Minimax estimators of location parameters for 
spherically symmetnic distributions with concave loss. Ann. Statist. 8 279-284. 


460 J.T. HWANG AND J. CHEN 


Brown, L. [). (1966). On the admissibility of invariant estimators of one or more location parame- 
ters. Ann. Math. Statst. 37 1087-1136. 

CASELLA, G. and HWANG, J. T. (1982) Employing vague prior information in the construction of 
confidence sets. Technical Report, Biometrics Unit, Cornell University. 

CASELLA, G. and HWANG, J. T. (1983). Empirical Bayes confidence sets for the mean of a multi- 
variate normal distribution. J. Amer. Statist. Assoc. 78 688—698. 

CHMIELEWSKI, M. A (1981). Spherically symmetric distributions: A review and bibliography. 
Internat. Statist. Rev. 49 67-74. 

FARRELL, It. H. (1964). Estimators of a location parameter in the absolutely continuous case. Ann, 
Math. Statist. 35 949-998. 

Hooper, P. (1982). Invanant confidence sets with smallest expected measure. Ann. Statist. 10 
1283-1294. 

Hwane, J. T. (1983). Robust improved confidence sets. Technical Report, Cornell Statistical Center. 

Hwanae, J. T and CAsELLA, G. (1982). Minimax confidence sets for the mean of a multivariate 
normal distmbution. Ann. Statist, 10 868-881. 

HWANG, J. T. and CASELLA, G. (1984) Improved set estimators for a multivamate normal mean. 
Statist. Decistons 1 3-16. 

HWANG, J T. and CHEN, J. (1983). How do the improved confidence sets for the mean of a 
multivanate normal distribution perform when the true distribution is nonnorma!? 
Technical Report, Cornell Statistical Center (onginal version) 

JAMES, W. and STEIN, C. (1961). Estimation with quadratic loss. Proc. Fourth Berkeley Symp. 
Math. Statist. Prob. 1 361-379. Univ. California Press. 

JOSHI, V. M. (1967). Inadmismmlty of the usual confidence sets for the mean of a multivariate 
normal population. Ann. Math. Statist. 38 1868-1875. 

STRAWDERMAN, W. E. (1974). Minimax estimation of location parameters for certain spherically 
symmetric distributions. J. Multivariate Anal. 4 255--264. 

THomas, D). H. (1970). Some contributions to radial probability distmbutions, statistics and the 
operational calculi, Ph.D. Dissertation, Wayne State University. 

ZELLNER, A. (1976). Bayesian and non-Bayesian analysis of the regression model with multivariate 
student t-error terms. J. Amer. Statist. Assoc. 71 400-4065. 


DEPARTMENT OF MATHEMATICS DEPARTMENT OF MATHEMATICAL SCIENCES 
CORNELL UNIVERSITY UNIVERSITY OF CINCINNATI 
ITHACA, NEW YORK 14853 CINCINNATI, OHIO 45221 


The Annals of Statotus 
EIRO, Vol 14, No. 2, 461-486 


ROBUST BAYES AND EMPIRICAL BAYES ANALYSIS 
WITH ce-CONTAMINATED PRIORS 


By JAMES BERGER AND L. MARK BERLINER’ 
Purdue University and Ohio State University 


For Bayesian analysis, an attractive method of modelling uncertainty in 
the prior distribution 1s through use of e-contamination classes, i.e., classes of 
distributions which have the form s = (1 — €)m + eq, % being the base 
elicited prior, g being a “contamination,” and e reflecting the amount of error 
in m% that is deemed possible. Classes of contaminations that are considered 
include (i) all possible contaminations, (1) all symmetric, unimodal con- 
taminations, and (iil) all contaminations such that » is unimodal. 

Two issues in robust Bayesian analysis are studied. The first is that of 
determining the range of posterior probabilities of a set as s ranges over the 
e-contamination class. The second, more extensively studied, issue is that of 
selecting, in a data dependent fashion, a “good” pror distribution (the 
Type-I] maximum likelihood prior) from the e-contamination class, and using 
this prior in the subsequent analysis. Relationships and applications to 
empirical Bayes analysis are also discussed. 


1. Introduction. The most frequent criticism of subjective Bayesian analy- 
sis is that it supposedly presumes an ability to completely and accurately 
quantify subjective information in terms of a single prior distribution. However, 
there has long existed [at least since Good (1950)] a robust Bayesian viewpoint 
which assumes only that subjective information can be quantified in terms of a 
class [ of possible distributions. The goal is then to make inferences or decisions 
which are robust over T, i.e., which are relatively insensitive (or at least are 
satisfactory) to deviations as the prior distribution varies over T. We will not 
consider the philosophical or pragmatic reasons for adopting this viewpoint. Such 
a discussion, along with a review of the area, may be found in Berger (1984). (We 
also do not mean to imply that the single prior Bayesian approach is necessarily 
bad; it usually works very well.) Related to this are various forms of empirical 
Bayes analysis [cf. Morris (1983) for discussion and review], in which the prior 
distribution is also assumed to belong to some class I’ of distributions. Indeed, 
Section 5 considers some familiar empirical Bayes problem from our perspective. 
Also, see Berger and Berliner (1984). 

Before discussing implementation of the robust Bayesian viewpoint, some 
notation is helpful. Let X denote the observable random variable (or vector), 
which will (for simplicity) be assumed to have a density f(x\|@) (w.r.t. some 


Received November 1983; revised October 1985. 

! Research supported by the National Science Foundation under Grants MCS-8101670A1 and 
IDMS-8401996. 

* Research supported by the Office of Naval Research under Contract N00014-84-K-0422. 

AMS 1980 subject classifications. Primary 62A15; secondary 62F15. 

Key words and phrases. Robust Bayes, empirical Bayes, classes of priors, e-contamination, type H 
maximum likelihood, hierarchical priors. 


461 


462 J. BERGER AND L. M. BERLINER 


measure), where @ is an unknown parameter lying in a parameter space 0. A 
prior distribution on @ will be denoted by ~ (later, in examples, 7 will be used to 
denote either a prior or its corresponding density), and the resulting marginal 
density of X is given by 


m(x|r) = E*f(x|8) = I f(x )r(d0). 


The posterior distribution of @ given x (assuming it exists) will be denoted by 
a(-|x) and, in nice situations, is defined by 


m(d6|x) = f(x|6)a(d@)/m(x|r). 


Finally, let Z denote the space of all probability distributions on 0. 
The class, l, of prior distributions to be considered in this paper, is the 
e-contamination class; namely, 


(1.1) T = {n: m= (1 -— e)m + eq, q E2}, 


where 0 < e < 1 is given, m is a particular prior distribution, and 2 is some 
subset of #. There are several reasons for consideration of this class. First, and 
foremost, it is a sensible class to consider in light of the prior elicitation process. 
The extensive and rapidly developing methodology on prior elicitation [cf. 
Kadane et al. (1980)] makes specification of an initial believable prior, 7), an 
attractive starting point. However, in determining m sensibly, one will make 
probability judgements about subsets of ©, judgements which could be in error 
by some amount e. Stated another way, further reflection might lead to alter- 
ations of probability judgements by an amount e. Hence, possible priors involving 
such alterations should be included in T. 

Many classes of priors which have been considered are not sensible from the 
above viewpoint. For instance, classes of priors involving restrictions on moments 
force severe restrictions on the allowable prior tails. This makes little sense from 
the elicitation viewpoint, since the tails of a prior involve very small probabilities 
and are, therefore, nearly impossible to determine. Similarly, classes of conjugate 
priors are too limited, particularly in their inflexible tail behavior [cf. Berger 
(1985) ]. 

Two other major reasons for choosing T as in (1.1) are (i) such T are (as we 
shall see) surprisingly easy to work with; and (ii) such T are very flexible through 
choice of 2. In this paper we will restrict consideration to four interesting choices 
of 2. First, in Section 2 the choice 2 = F (all distributions) will be considered. 
This choice is easy to work with and is, in some sense, conservative. In Section 3 

-~we consider the class, 2, of all contaminations which are symmetric and uni- 
modal. This is again very easy to work with. In Section 4 we consider the class of 
all contaminations such that the resulting m is unimodal (assuming that zy is 
unimodal). It came as a great surprise to us that such a complicated class could 
be worked with and provide reasonably simple answers. Finally, in Section 5 we 
consider 2 that are mixtures of various classes. The purpose of the section is to 
show how easily mixed contaminations can be dealt with and also to apply the 
methodology in some typical empirical Bayes situations. 


ROBUST BAYESIAN ANALYSIS 463 


Other articles that have used e-contamination classes of priors include 
Schneeweiss (1964), Blum and Rosenblatt (1967), Huber (1973), Marazzi (1985), 
Bickel (1984), and Berger (1982, 1984). Except for Huber (1973), these articles 
work within the frequentist Bayesian framework, whereas our approach will be 
almost entirely conditional Bayesian. Huber (1973) is discussed below and in 
Section 2.4. There is a substantial literature working with other types of classes 
of priors [cf. Leamer (1978) and DeRobertis and Hartigan (1981)], and with the 
very related idea of “upper” and “lower” probabilities. Berger (1984) contains 
considerable review and discussion of this literature. We strongly prefer the class 
in (1.1) for intuitive content and ease of analysis. 

The ideal analysis, to a robust Bayesian, is one in which it can be shown that 
the inference or decision to be made is essentially the same for any prior in T. 
[Indeed, it can be argued—see Berger (1984, 1985)—that this is the only way in 
which a statistical conclusion can claim to be ultimately sound.] What is needed, 
to provide such conclusions, is essentially the ability to find minimums and 
maximums of criterion functions as 7 ranges over I. We illustrate this approach 
in Section 2.4, where, for 2 = {all distributions), the range of posterior probabili- 
ties of a (fixed) set C is given [essentially following Huber (1973)]. This allows 
finding the range of posterior probabilities of confidence sets and the range of 
posterior probabilities of hypotheses, for such T. 

Unfortunately, there are certain inadequacies in assuming that 2 = {all distri- 
butions} (see Section 2.3), and attempting the above program with more reasona- 
ble [ (such as that in Section 4) becomes more difficult. A number of alternative 
approaches to the problem of dealing with classes of priors have thus been 
proposed, essentially leading to the choice of a single “robust” prior, decision, or 
inference. Berger (1984, 1985) discusses various of these methods, including the 
appealing technique of putting a prior distribution on T. [Such a prior is called a 
hyperprior or a Type II probability distribution by Good (1965,1980) and a 
second-stage prior in certain situations by Lindley and Smith (1972).] Of course, 
this corresponds to using a certain single prior (the “average” over I), but one 
would suspect that the resulting Bayes rule would be quite robust with respect to 
Tr. The difficulty in doing this is mainly technical: it is essentially impossible to 
put a reasonable prior on complicated I, such as those in Sections 2—4, and carry 
out the Bayesian calculations. Note also that, ideally, most of the prior informa- 
tion available will have been exhausted in constructing T. Hence, any prior 
distribution placed on I will to a large extent, be arbitrary. 

Instead, we will consider the simplest and most commonly used method of 
selecting a hopefully robust prior in I, namely choice of that prior 7 which 
maximizes the marginal m(x|7) over T. This process is called Type II maximum 
likelihood by Good (1965). For 7 = (1 — e)ma + eq, q € 2, maximizing m(x|7) = 
(1 — e)m(x]7) + em(x|q) over 7 is clearly done by maximizing m(x|q) over q. 
Assuming that the maximum of m(x|q) is attained at (a unique) ĝ € 2, we will 
then suggest formally using the estimated prior 7, given by 


(1.2) f = (1 —e)a + eG. 
(Of course, 7 thus depends on x.) Throughout the paper, 7 will be called the 


464 J. BERGER AND I.M BERLINER 


ML.-IT prior. Also, any quantities derived from 7 will appear with the modifier 
“ML-IL” for clarity. 

Choosing a prior with the help of the data always engenders controversy. 
Several justifications for doing so can be given, however. First, if m(xj) is 
“small,” it is simply unlikely that such a v could be “true,” and hence worrying 
about such 7 is counterproductive. Recall that (supposedly) all 7 € I are deemed 
to be reasonable representations of priors beliefs, so 7 is simply the prior which is 
most plausible, in light of prior opinions and the data. A more formal way of 
saying this is that, if all 7 € I are roughly equally likely a prion, then # is the 
“posterior mode” of the “uniform” distribution on I’, and might often be 
expected to yield a posterior distribution that is close to the true posterior 
distribution for such a “uniform” distribution on T. 

The preceding argument for 7 is, of course, nonrigorous, and the ultimate 
justification for proceeding in this way is simply that it can give reasonable 
answers. Of course, there is already substantial evidence in the literature attest- 
ing to the success of the method, both in the Bayesian literature [cf. Jeffreys 
(1961), Good (1965, 1980), Box and Tiao (1973), Bishop, Fienberg, and Holland 
(1975), and Zellner (1985)], and in the empirical Bayesian literature [cf. Maritz 
(1970) and Morris (1983)]. Indeed, note that the “standard” empirical Bayes 
methodology is to choose T to be a class of conjugate priors and then to estimate 
the “hyperparameters” of the prior by maximizing m(x|7), yielding 7. Also the 
related use of the marginal in Bayesian model robustness investigations is well 
established [cf. Box and Tiao (1973), Dempster (1975), and Box (1980)]. When all 
is said and done, however, we recognize that the ML-II technique is not foolproof 
and can produce bad answers, particularly when I includes unreasonable distri- 
butions. (The basic problem with the ML-II technique is that ensuing calcula- 
tions of variability do not take into account the “error” of the ML-II estimation; 
see Section 2.3 for an extreme example of this problem.) In Section 6 we give a 
general discussion of the success of the method for the situations discussed in the 
paper. 

We conclude this section with useful formulas and notation. For priors of the 
form 


(1.3) a(d) = (1 — e)a,(d@) + eq(d@), 


computations give [assuming the existence of the posterior distributions 1)(d@6@|x) 
and q(d6\x)| 


(1.4) m(x|) = (1 — e)m(x|m) + em(x\q) 
and 
(1.5) m(dOjx) = A(x)m(d6|x) + (1 — A(x))q( Ax), 


where A(x) € [0,1] is given by 
(1.6) A(x) = (1 — e)m(ximo)/m( xir). 


Furthermore, the posterior mean, ô”, and posterior variance, V”, can be written 


ROBUST BAYESIAN ANALYSIS 465 


(assuming they exist) as 
(1.7) 5*(x) = A(x)8%(x) + (1 — A(x))b%(x) 
and 

l V*(x) =A(x)V%(x) + (1 — A(x))V%x) 
(1.8) 

+A(x)(1 — A(x))(8%(x) — 8%(x))*. 

Part of the appeal of the e-contamination class, T, is the simplicity of these 
formulas. 


2. Analysis for arbitrary contaminations. A natural suggestion for a 
class of contaminations of a fixed, elicited prior 7, is the class of all possible 
contaminations. In this section we will examine inferences, including point 
estimation, testing, and credible regions, for such a class, i.e., for 


(2.1) [T= {aia =(1—e)a + eg, q EF}. 


In a number of respects this is too large a class of priors, including many priors 
that are unreasonable. And it will be seen that this can lead to serious difficulties 
in some situations (although for certain purposes no problems are encountered). 
We give a fairly detailed analysis of this situation because its relative simplicity 
allows easy comprehension of important concepts (including the difficulties of 
using too large a I), and because some useful robustness results do emerge. All 
proofs are easy and are omitted. 


2.1. The ML-II prior and posterior. For T defined as in (2.1), the ML-II prior 
and corresponding posterior are as follows. 


THEOREM 2.1. Assume X has a density f(x|0) w.r.t. some dominating 
measure on the sample space of X. Assume that the usual maximum likelihood 
estimator for 6, say 6(x), exists and ıs unique. For T defined as in (2.1), the 
ML-IT prior ts given by 
(2.2) #(-) = (1 — emh) + €9,(-), 
where q, assigns probability one to the point 0 = 6(x). The ML-II postertor is 
given by 
(2.3) i(-|x) = M(x) m(-lx) + (1 - A(x))4,.C), 
where 
(2.4) A(x) = ia e)m(x|m)/[ (1 — e)m(x|m) + ef (x|8(x))]. 

2.2. The ML-II posterior mean. Under the assumptions of Theorem 2.1, the 
ML-II posterior mean of @ is given by [see (1.7)] 

(2.5) 5*(x) = A(x)87(x) + (1 — M(x))8(x). 


As an estimator of 0, ô? is intuitively appealing, in that it is a reasonable data 
dependent mixture of 6*° and 6. When the data are consistent with mọ, m(x|7) 


466 J. BERGER AND L. M. BERLINER 


will be reasonably large and A(x) close to one (for small e), so that 6* will 
essentially equal 6*°. When the data and m, are not compatible, however, 
m(x|m,) will be small and A(x) near zero; 8° will then be approximately equal to 
the m.l.e. 6. 

The following example presents ô? in an important situation. Some properties 
of the estimator are discussed which give a degree of “outside validation” to the 
estimator. 


EXAMPLE 1. Let X =(X,,..., X,)'~%(6,07I,), where 8 = (6,,...,6,)' is 
unknown and o is known. Suppose ‘the elicited prior, To, for 6 is M (u, T°). 
(Thus u and T? are specified.) Since the usual maximum likelihood estimator of 8 
is 6(x) = x, and 6(x) = x — [o?/(0? + +”) (x — u), formula (2.5) reduces to 


8%(x) = (1 — À(x)a?/ (0? + 7°))(x - p) + p, 


where 


Ax) = [1 + (e/(1 — e))(1 + 72/02) exp{|x — u?/2t0? + r2)}] 7 


Note that A goes to 0 exponentially fast in |x — y|?, so that 8*(x) > x quite 
rapidly as |x — u|? gets large. Because of this, one might conjecture that the 
estimator is minimax, in a frequentist decision-theoretic sense under, say, 
quadratic loss. Unfortunately, this turns out not to be the case, although the 
deviation from minimaxity is usually fairly slight. It is also interesting to note 
that ô? happens to coincide with the generalized Bayes estimator corresponding 
to the formal prior 


p(d@) = (1 — e)mo(d0) + epo(d0), 


where p,(d@) = (270°)P d0. Note that priors of a similar form were considered 
by, for instance, Leonard (1974). The development here can be viewed as 
proposing a reasonable method for choosing the relative weights of m, and (d8). 


2.3. The ML-II posterior variance. To determine the estimation error in 
using 6”, it is natural to look at the posterior variance, V”. From (1.8) it follows 
that 


V8(z) =A(x)[ V(x) + (1 ~ A(x))(8"(x) - ôl). 


It will typically be the case (as in Example 1) that, as \(x) > 0, Va) will also 
go to zero. Indeed, 7 will usually “converge” to a point mass at 6(x). This is 
clearly inappropriate; although data incompatible with mọ can be cause for 
preference of 6(x) to 8™(x), it does not cause one to think that @ equals 6(x) 
exactly. 

The trouble here is caused by the fact that T contains unrealistic distributions. 
We may feel that m, could be in error, but surely a point mass at @(x) (when far 
from the center of mọ) is not usually a reasonable contamination to expect 
a priori. Working with I as in Sections 3 and 4, which do not allow such 
implausible contaminations, substantially alleviates this problem. (See also Sec- 
tion 6.) 


ROBUST BAYESIAN ANALYSIS 467 


2.4. Robustness as m ranges over T. As mentioned in the introduction, the 
ideal goal for a robustness study would be to show that a decision or inference 
being contemplated is satisfactory for all m € IT. When 2 is the class of all 
distributions, it often becomes feasible to check this. The basic tool is the 
following result of Huber (1973), concerning the range of posterior probabilities of 
a set. 


THEOREM 2.2 [Huber (1973)]: Suppose 2 = P. Let C be a measurable subset 
of ©, and define B, to be the posterior probability of C under 1, i.e., 


By = P™(0 € CIX = x). 


Then 

Swan e8UDy ech (x|8) \ 
(2.6) inf P (9@EC\X =x) = fl + TER A ° 
and 


(1 = €)m(x\%)Bo + €SUPy ccf (x|F) 
| poco- =. ee. 
EON) = are Ta ee aie a aO 


EXAMPLE 2. Assume that X ~ (6, 0%), o? known, and that a is W(p, 7”). 
It is well known that 7)(d6|x) is W(8(x), V*), where 
(x) =x — (0?/(0? + 7?))(x — u), V? = o?r?/(0? + 7°). 
The usual 100(1 — a)% Bayes credible region for @ is 
C= (0: 8(x) -K<6<68(x4)+K}, 


where K = 2, VY, Zaz being the 100 (1 — a/2) upper percentile of the standard 
normal distribution. 
To investigate the robustness of C, we use (2.6) of Theorem 2.2. Note that 


(ro?) ifx €C, 
sup {(x|@) = 3 1 
GEC (2207) exp| — 592 Ix — 6(x)| - K)'| ifx ec. 


Thus (2.6) becomes, for x € C, 
l e( 
inf P*(@ e CX =x)=(1- o|: + 
and, for x € C, 
| e( 
inf P(@ Ec\x=x)=(Q- oft + 
TE 
~1 


(x — p)? — (|x — u V/T - za or)? | 


xex] 2(0? + 7?) 


468 J. BERGER AND L. M. BERLINER 


As a concrete example, suppose that o° = 1, t? = 2, u = 0, and e = 0.2. First, 
suppose x = 0.5 is observed. Then the usual 95% Bayes credible intezval for @ is 
(— 1.27, 1.93). Calculation gives 


inf P*(-1.27 < 8 < 1.93|X = 0.5) = 0.817, 
sup P*(—1.27 < 6 < 1.93|X = 0.5) = 0.966. 


vel 
Hence, the standard credible set is reasonably robust. On the other hand, suppose 
x = 4 is observed. [Note that, since m(x|7,) is WM (0,3), this is not an “outra- 
geous” observation.| Then the usual 95% credible set is (1.07, 4.27). However, in 
this case we have that 


inf P*(1.07 < 0 < 4,27|X = 4) = 0.1355, 
Tre 
sup P*(1.07 < 6 < 4.27|X = 4) = 0.99. 


ao 
Since the posterior probability can get as low as 0.1355 for x = 4, robustness is 
not present. 

Two interesting general points emerge from the previous example. First, 
robustness with respect to I’ will usually depend significantly on the x observed. 
Second, a lack of robustness may be due to the fact that I is “too large.” When 
x = 4, for instance, the low probability of coverage (0.13855) is achieved when the 
contamination, q, is a point mass at 4.27. The resulting prior would probably not 
have been deemed to be reasonable a priori. Using a more reasonable y might 
result in robustness. Also, more robust credible sets can be found—1indeed Berger 
and Berliner (1983) determine the optimal 1 — a robust credible set, optimal in 
the sense of having smallest size (Lebesgue measure) subject to its posterior 
probability being at least 1 — a for all v in I’. In any case, the use of 2 = F and 
Theorem 2.2 is conservative, in that, if robustness of a credible set is achieved 
for such T, one knows that robustness is also present for the more reasonable, 
smaller T. 

Theorem 2.2 can also be used for hypothesis testing. Thus suppose we desire to 
test the hypothesis H): @ € ©, versus the alternative H,;: 8 € © — ©). For a 
fixed prior 7, the usual Bayesian test is based on the posterior odds ratio O.(x), 
defined by 


O,(x) = P"(8 € OX = x)/[1 — P*(0 € OX = x)]. 
Letting C = @,, Theorem 2.2 immediately yields the following: 


COROLLARY 2.1. For I as in (2.1), 

-1 
E 8UPg ¢ 9, f (x10) 

inf O (x) = O, (x) 1 + —————-————_ 

-a | O-U- hmle) 

and 


sup O,(x) = On (x) 


cel 


where B, = P™(0 = OX = x). 


i i €SUD, ce, f(x]8) 
(1 — e)(1 — By)m(x|mq) J’ 


ROBUST BAYESIAN ANALYSIS 469 


In testing, it will usually be much easier to achieve robustness using this “too 
large” T, since extreme x [i.e., x for which m(x|7,)) is small], which lead to the 
unrealistic point mass contaminations, will usually provide extreme evidence for, 
or against, ©). (The difference between the inf and sup of O, may be substantial, 
but they will both be substantially less than one or substantially greater than 
one.) Together with the simplicity of the results in Corollary 2.1, this makes the 
use of 2 = # very attractive for robustness investigations in testing. 

It should be clear that Theorem 2.2 is also immediately applicable to the 
testing of several hypotheses and to classification problems. Lower and upper 
bounds on the posterior probabilities of all hypotheses can be obtained. 


3. Symmetric unimodal contaminations. A natural, yet remarkably trac- 
table, class of priors to consider when © c R}, is the e-contamination class 
defined (for fixed @,) by 


(3.1) 2 = {densities of the form q(|@ — 8|), q nonincreasing}. 


This class is particularly reasonable when m itself is unimodal and symmetric 
about 9,. Note that under such circumstances, the resulting contaminated priors 
a display the desirable properties that (i) values of 0 far from @, cannot be given 
overwhelming weight (unlike the possibilities observed in Section 2), but (ii) 
priors with tails larger than 7, are considered. 

The considerable simplicity of working with (3.1) accrues from the fact that in 
much of the analysis, (3.1) can be replaced by 


(3.2) ‘= {Uniform (6, — a, 6) + a) densities, a = 0}, 


where the “density” when a = 0 is a point mass at 0). Required optimizations 
thus involve only the variable a. Preliminary analyses of the type discussed in 
Section 2.4 have been carried out in this manner, but are a bit involved and will 
be reported elsewhere. However, the ML-II prior is quite simple to present: 


THEOREM 3.1. For the e-contamination class with 2 as in (3.1), an ML-IT 
prior is 
ï = (1 — ejm + eĝ, 
where q is Uniform (b) — â, 8) + â), â being the value of a which maximizes 
(2a) f Hax) dé, a> 0, 
(3.3) m(xja) = h~a 
f(x|9), a= 0. 


PROOF. The proof follows trivially after noting that (i) any prior in (3.1) is a 
mixture of priors in (3.2), and (ii) m(x|7) is a linear functional of 7. O 


Theorem 3.1 is an adaptation of a result in Berger and Sellke (1984), who 
utilized the fact that m(x|@) is an upper bound on m(x|q), q in 2 given by (3.1), 
to establish startling lower bounds on posterior probabilities of point null 


470 J. BERGER AND L. M. BERLINER 


hypotheses that are an order of magnitude larger than classical significance levels 
or P-values. Here we utilize the theorem to calculate the ML-II posterior mean 
and variance in an illustrative example: 


EXAMPLE: ESTIMATING A NORMAL MEAN. Suppose X ~ (8, 07), o? known, 
and m is W(%, 77), 6 and 7? given. Let 2 be as in (3.1) and define z = (x — 
4,)/o. Following Berger and Sellke (1984) [see also Berger (1985)], @ of Theorem 
3.1 18 


_ {0 if z < 1.65, 
(3.4) an A if z > 1.65, 


where a* satisfies the equation 
â* = |z| + |-2log(vV27 {[®(a* — |z|) - &(-(a* + |z|))]/a* 


1/2 
~@(—(a* + e)p], 
® and ¢ denoting the standard normal c.d.f. and density, respectively. Equation 
(3.5) can be solved (usually very quickly) by standard fixed point iteration, 
starting on the right-hand side with initial value a* = jz]. 


(3.5) 


CASE 1: â > 0. It can be shown that the ML-II posterior mean and variance 
corresponding to the Uniform (4, — â, 6, + @) prior are, respectively, 


5%(x) = x — (o/a*)tanh(za*) 
and 
V4(x) = 0?[z — (o?/a*)tanh(za*)|[a*~'tanh(za*)]. 
Furthermore, 
A(x) = 1 + (0.5e(1 + 72/07) 7/0 — e))(1 + exp(—2za*)) 


x exp{ —0.5((12z?/(6? = 7*)) + a** — 2za*)} | = 
so that the ML-II posterior mean and variance are [see (1.7) and (1.8)] 
(3.6) 8°(x) = A(x)(x — (0?/(a? + 1))(x — O)) + (1 — A(x))6%(x) 


and 


Hx) = A(x)7202/( 7? + 0? — A(x) V(x 
gy M(x)1707/(1? + 07) + (1 — À(x)) V%(x) 


+X(x)(1 — A(x))[(0/a*)tanh(za*) — o?(x — 0,)/(0? + 7?)]”. 


CAsE 2: â= 0. If &@=0, then Xx) = 6 and Vx) = 0. The formulas for À, 
57, and V* are then easily obtained and omitted here. In this case, however, 
though x is close to 6, it is probably undesirable to allow @ to be more 
concentrated at 9, then 7). The natural “fix-up” is to simply replace (6°, V*) by 
(6%, V7). In fact we generally recommend this modification whenever V* < V”. 


ROBUST BAYESIAN ANALYSIS 471 


TABLE 1 
ML-II results for ununodal, symmetric contammatons 
x a 87 À 3t y? 
3.00 4.13 2.000 0.661 2.257 0.796 
5.00 6.43 3.333 0.166 4,594 1.054 
7.00 8.61 4.667 4.7 x 1073 6.873 0.842 
10.00 11.78 6.667 1.3 x 1078 9.915 0.851 


As a specific example, suppose o° = 1, 6) = 0, T? = 2, and e = 0.2. Values of 
ML-II quantities and 6%°(x) = 2x/3 for various x are given in Table 1. Further 
numerical results are given in Section 6. 


4. Unimodality preserving contaminations. 


4.1. Introduction. Despite the simplicity of the analyses in Sections 2 and 3, 
there are some objections. In Section 2 we saw that choosing 2 = P = {all 
distributions} can cause serious problems, due to the unrealistic nature of some of 
the resultant priors in T. On the other hand, the restriction to symmetric 
unimodal contaminations in Section 3 could be criticized for not allowing certain 
plausible contaminations, particularly when 7, is not symmetric. When O c R? 
and 7, is unimodal, as will be assumed throughout this section, perhaps the most 
appealing 2 is that which contains those contaminations which preserve the 
unimodality of m = (1 — e)ma + eg (note that q need not be unimodal). Any such 
a would typically be plausible a priori, and virtually all + deemed reasonable 
a priori are in this class. Successfully working with such “minimally complete” T 
is a very desirable goal in Bayesian robustness. 

The main goals of this section are to indicate that such complex classes can, 
surprisingly, be analyzed and to provide some basic mathematical techniques 
likely to be encountered in such analyses. The presentation here is restricted to 
the determination of the ML-II prior and the ML-II posterior mean and 
variance, a detailed example concerning the mean of a normal distribution, plus 
some discussion of results later in Section 6. We are currently investigating the 
more fundamental question of determining ranges of posterior measures as 7 
varies over T. 

The exact class T that will be considered is (where 6, denotes the mode of 7, 
which we assume to be unique) 


r = {7 = (1 — e)a + eg: q E2, the set of all probability 
densities for which r is unimodal with (not necessarily 
unique) mode 6), and (8) < (1 + erho). 


The final inequality in the definition of T specifies the reasonable constraint that 
q not be allowed to concentrate too sharply near 9). (Usually it would be 
reasonable to select e’ = e, but this is not necessary. Indeed the choice e’ = 0 


472 J. BERGER AND L. M BERLINER 


might sometimes be desired, it having the attractive property of ensuring that 
the ML-II posterior variance never drops much below that of m, because of 
excessive prior concentration near 6.) We will also assume that the likelihood 
function f(x|@) is unimodal (as a function of 0, of course) with unique mode 8. 
[Of course, x is fixed, so f(x|@) need only be unimodal for the observed x, not for 
all x.] It will also be technically convenient to restrict consideration to 7, and f 
which are nonzero and are strictly monotonic on each side of the modes. More 
general cases could be handled, but the results get messier. We also assume, 
without loss of generality, that 6 > 4. 


4.2. The form of the ML-II prior and posterior. The formal calculation, in 
Sections 4.4 and 4.5, of the ML-II prior, 7, is complicated by the need to consider 
several different cases. The result, however, is always of the quite simple form 


. K for 6 € B, 
(4.1) (0) = (a — e)m(0) forð €B, 


ie, # is uniform over B (which will be an interval about 6), and equals 
(1 ~ €)7,(8) outside of B. (Note that K is implicitly defined by the constraint 
that + have mass one.) Thus the MI-II posterior will be 


(4.2) (Olx) = M(x) m(O]x) + (1 — A(x))G(4]x), 
where [letting J,,( 8) denote the usual indicator function on B] 

q = e 1K see e)m(9)| Ln( 8), 
(4.3) Mx) = (1 ~ e)m(x|m)/[(1 — e)m(x|n) + em(x|9)], 


m(xiq) =e" f [K — (1 — ©) mo(9)] f(x18) d8. 

The interesting case (Case 1 in Section 4.5) is that in which |Ê — @,| is 
moderately large (1.e., where prior and likelihood are not in close agreement), 
since it is only in this case that the choice of 7 € I will have a substantial effect. 
As |@ — 6,| — œ, it can typically be shown that the uniform piece of ê dominates, 
in the sense that 


a(Olx) > i210) flea) as, 


which would be the posterior for the noninformative uniform prior. This kind of 
behavior can be labelled “robust” from a number of viewpoints [cf. Berger 
(1984)], and is certainly more pleasing than the limiting behavior of #(@|x) in 
Section 2, which collapsed to a point at ĝ in the limit. Discussion of the degree of 
“robustness” which is attained by 7 is delayed until Section 6. 


4.3. The normal distribution. We present here an example of the overall 
theory, using the formulas from Sections 4.4 and 4.5. The example considered is 
that in which X is (6,1) and m is (0,77). [The more general case where 
X ~ NV(6,07) and a, is W(t, 77), o”, T”, and p all known, can be reduced to 


ROBUST BAYESIAN ANALYSIS 473 


this case by a linear transformation.}] Only Case 1 will be considered (which here 
means that |x] is larger than a certain constant depending on £ and 7) and we 
assume that x > 0. Let © denote the standard normal c.d.f., and @ the standard 
normal density function. 

The set B in (4.1) is the interval [@*, w(@*)], the endpoints being implicitly 
defined by the equations [noting @* < x < w(@*)] 


aE E-E] a-o] -o 2] 








(4.4) T T T 

o(w(a*) — x)[w(O*) — 0*] — [(w(6*) — x) — &(6* — x)] = 0. 
These equations can be easily solved by iteration. Calculation then gives that the 
ML-II posterior is 








. Mx)C,f(x|@) if@eB, 
n(Olx) =; 
A(x)a (Ox) if@é B, 
where 
Tx T? 
jx) i 8, V? = ri 
To( Ix) is W(6, V*), ô {Awe V l+r2 


(this notation is more convenient here than the previous 6”, VY), 
Mx) = [1 — By + Cyp(w(4*) — x)(w(9*) — 9*)]~, 
(1 +r?) ol0*/rT) w(0*)— 8 0*— 6 
o elitr V Jas V | 
The ML-II posterior mean and variance are given, respectively, by 
6* = \(x)6 + [1 — A(x)] 84 





, and B= o| 


and 


V* = X(x)V? + [1 — A(x)] V9 + Al Àla — 877’, 


À 
8Î = x + (CD, sa Eg + (x — Ro) Gay 


Vi = A(x) — 1) — CD (28? — x — 6*) + (x — 83 lAl) -1+ B,| 
—(0* + § — 284)E, — |V? + (8 — 89)"| B, 
+V[w(6*) — 6*] ¢([w(*) — 8] /V)}, 
D, = o(9* — x) — o(w(0*) — x), 


nE) oH 


and 


474 J. BERGER AND L. M. BERLINER 


TABLE 2 
ML-IT quantites for various x 

x B 8 À 8? vi 

1.75 (0, 2.53) 1.167 0.609 1.425 0.599 

3.00 (1.39, 3.73) 2.000 0.375 2.581 0.616 

5.00 (2.25, 6.08) 3.333 0.052 4.735 0.666 

7.00 (2.64, 8.35) 4.667 1.6 x 107? 6.827 0.735 
10.00 (2.97, 11.61) 6.667 45 x 1077 9,880 0.797 


As a specific example, suppose 7* = 2 and e = 0.2. Then one can show [using 
(4.6)] that Case 1 occurs providing |x| > 1.75. Table 2 gives the relevant quanti- 
ties above for various x. 

The behavior alluded to earlier clearly obtains: as |x| gets large, À —> 0, and 
the uniform part of the prior (on B) dominates. Also, 6*(x) > x and V*(x) > 1. 
Indeed, the following theorem gives large x approximations to the key quantities, 
approximations which are accurate, in the above example, for x = 7 and which 
show that the domination of the uniform portion occurs at an exponential rate. 
(The proof of the theorem is routine and will be omitted.) 


THEOREM 4.1. As x > œ (log denotes natural logarithm), 

g* = [2r*logx]'” + 0(1), w(0*) =x +4 [2log(x/v2a )|'” + o(1), 
A(x) = (e7? — 1)x[(1 + 7”) 2m]  exp{ —x?/[2(1 + r?) ha + o{1)), 
8%(x) =x—x—7(1+0(1)), and V%(x) =1- 2x7 [2logx]'(1 + o(1)). 


4.4. Preliminaries and notation for the general theory. For —e’ <p<e, 
define v(p) = 9, implicitly, by 


(4.5) mo(6)(1 — p)(v(p) - &) - (1 = €) f T8) dd = e, 

and define 

(4.6) V(p) = f(xlv(p))(v(p) - 8%) — [or Ha) dð. 
For 6, < 9, define w(8) = 6, implicitly, by 

(4.7) (1 — e)m(@)(w(8) ~ 8) ~ (1 — e) f” (E) dé = 

and define 

(4.8) (8) = F(x|w(8))(w(9) = 4) = f“ FCE) a. 


LEMMA 4.1. (a) The quantities v(p) and w(@) are well defined, unique, 
continuous, and strictly increasing for —& < p < e and 6 2 4%. 


ROBUST BAYESIAN ANALYSIS 475 


(b) If o(p) > 8, then V(p) is decreasing in p. Furthermore, V(p) = 0 kas at 
most one solution. 

(c) If 0, < 0 < 6 and w(6) > 6, then W(6) is decreasing at 8. Furthermore, if 
Vie) > 0, then W(8) = 0 has a unique solution 6, < 6* < Ê. 


Proor. (a) At v= 4, the left-hand side of (4.5) is zero. As v > œ, the 
left-hand side of (4.5) goes to oo. Finally, since 7, is decreasing for @ > 6, the 
derivative, with respect to v, of the left-hand side of (4.5) is easily seen to be 
strictly positive. A solution to (4.5) thus exists and is unique. 

To show that v(p) is strictly increasing, one can differentiate both sides of 
(4.5) with respect to p and solve for v’(p) [i.e., d/dp v(p)], obtaining 


v'(p) = mbo) olp) — 8)/[ m(%)(1 — p) — (1 — €)m(v(p))I. 
Since o(p) > bp, p < £, and a, is decreasing for 0 > 9, it is clear that v’(p} > 0. 
The verification for w(@) is very similar. 
(b) Letting f (x|@) = d/dé@ f(x|@), calculation gives 


d , 
qg Ve) = f (ælele))le)lolo) = 4). 


Since f is decreasing for 0 > ĝ, the monotonicity result follows from part (a). If 
V(p) = 0, the unimodality of f ensures that v(p) > 6 [for otherwise the right- 
hand side of (4.6) is positive]. The strict monotonicity of V for such p ensures 
that any solution to V(p) = 0 must be unique. 

(c) Letting w’(@) = d/d w(8@), calculation gives 


d 
gg WO) = F'(xlw(8))w(8)(w(8) — 8). 


The monotonicity of f and part (a) show that this is negative. Using this, to show 
that W(@) = 0 has a unique solution, it is only necessary to show that W(,) = 0 
and W(ĝ) < 0. Since v(e) = w(6,), it follows that W(6,) = V(e) = 0 (by assump- 
tion). That W(6) < 0 follows from (4.8) and an easy application of the mean 
value theorem [since {(x|@) decreases for 0 > 6}. o 


LEMMA 4.2. Suppose Vie) 20, and let 6,<0* < Ê be the solution to 
W(é@) = 0. Then 


(a) f(x|@) < f(x|w(8*)) for 0 € [O*, w(8*)]. 
(b) For any nonincreasing integrable function g such that [g(@)d0 = 0, it 
follows that 


(4.9) Í n'a 8) f(x\0) dé <0. 


Proor. (a) Clearly f(x|@*) < f(x|w(6*)), for otherwise the integrand in (4.8) 
would be everywhere larger than f(x|w(8*)) and W(@*) would be nonzero, a 
contradiction. The unimodality of f thus gives the result for 8 < 0*. Now 
w(@*) > 0, for otherwise (4.8) could again be used to contradict W(é*) = 0. The 
unimodality of f thus also gives the result for 0 > w(@*). 


476 J. BERGER AND L. M. BERLINER 


(b) Note first that it suffices to prove the result for differentiable g. Letting 
h(@) = ~d/d@ g(6) (note A > 0) and writing g(6) = K — ff h(&) dé, where 
1 w(@*) 
(4.10) 


EOE ee - E)A(E) dg, 
we obtain from Fubini’s theorem 
fore) tlx) do = Kf" 

- f' OWE) f CEOE x0) dO dé. 


Next we show that, for @* < £ < w(@*), 
412) B = f(x) ao = (wl a") = Dla). 


OD x18) dO 


(4.11) 


For £ > 6 this is a trivial consequence of the monotonicity of f. For @* <i <Ê, 
note that ~(§) is concave [ f(«|€) is increasing here] and that 


(4.18) (9%) = f° Hao) dO = (w(0*) = 0*) f(x100(4*)) 


[since W(@*) = 0]. Hence, (é) must lie above the line (w(6*) — €)f(x|w(@*)), 
establishing (4.12). Using (4.12) in (4.11) we get that 


MOCO 
w(8*) 


< Kf" Halo) do — f(xjw(9*)) f1 "(w(9*) — ERCE) a, 
@* @* 


the right-hand side of which is zero by (4.10) and (4.13). D 


4.5. The ML-II prior. Define 7 as follows: 


Case 1. If V(e) > 0, and 6* € [0,, 6] is the solution to W(6) = 0, let 
(l—e)m(0*) for@* <6 < w(0*), 
(1—e)2,(8) otherwise. 


(4.14) Oe | 


CasE 2. If Vie) < 0 but V(—e’) = 0, find p* € [~,e] so that Vi(p*) = 0, 
and let 
(1 — p*)m(8)) for 0) < 8 < v(p*), 


oan = G — e)m(0) otherwise. 


Case 3. If V(—e’) < 0 and f(x]ð,) < f(xlo(—e)), let 7 be as in Case 2 with 


p* = =g, 


ROBUST BAYESIAN ANALYSIS 477 


Case 4. If V(—e’) <0 and f(x|8,) > f(xjv(—e’)), let 


(1 + e’)a(8) for#’<0< 06", 
(1 —e)m(0) otherwise, 


(4.16) a8) = | 


where 8’ and 8” are the (unique) solutions to the equations 
f(x|0’) = f(x180”), 
(1 + €') m9(8)(8” — 0°) ~ (1 — e) f (0) do =e. 
p’ 


Lemma 4.1 establishes that all quantities involved in the definition of #7 are 
well defined and unique. (The existence and uniqueness of 6’ and 0” in Case 4 is 
easy to establish.) Observe that, in all cases, 7 has a very simple and easy to work 
with form of being uniform in a certain interval, and otherwise being equal to 
(1 — e)a. Case 1 corresponds to the situation where the elicited prior, 7), and the 
likelihood function, f{(x|@), are moderately separated, Case 2 to the situation 
where they are fairly close, and Cases 3 and 4 to situations where they are very 
close. 


(4.17) 


THEOREM 4.2. The # defined in (4.14)-(4.16) is the ML-II prior in T. 


Proor. We only present the argument for Case 1, the other cases being very 
similar. The goal is to show that 
(4.18) m(x|r) — m(x\#) = f [7(8) — #(9)] f(x19) do < 0 
for all + € I’. Letting g(@) = 7(0) — #(6@), note that 
(i) g(@) = 0 for 6 £ [0*, w(O*)], since 7(8) = (1 — e)z,(@) here and 7(@) > 
(1 — ejm (8); 
(ii) g(@) is nonincreasing on [ @*, w(@*)], since 7(@) is uniform on this interval 
and so a(@) = g(@) + #(@) would have a secondary mode were g(@) 


somewhere increasing; 
(iii) K = gt g(0) d0 = — fior, warg) ad. 


Lemma 4.2(a) and (1) show that 


f (9) f (x1) dé < f(x\w(*))(—K). 
[O*, w(8*)] 


Lemma 4.2(b) and (ii) imply that 
Í HACO : ray ry) d0 <0 
o» [w(0*) — 0*] 


Thus 
(4.19) felo) f(z) d6 < f(xjw(0*))(-K) + CLEN f(x|@) dé. 


Since W(@*) = 0, the right-hand side of (4.19) is zero, and (4.18) follows. 0 


478 J. BERGER AND L M. BERLINER 


COMMENTS. 1. The key step in the proof of Theorem 4.2 is really Lemma 
4.2(b), which shows that one cannot improve on a uniform @ on [@*, w(0*)]. 
2. The problem might be susceptible to attack through calculus of variations, 
since one is trying to maximize an expression involving an integral of 7 over a 
class of 7. The difficulty is that the r € T satisfy a large number of inequality 
and differential inequality constraints. Calculus of variations with such side 
constraints is quite difficult. 


5. Hierarchical classes of priors. 


5.1. Introduction. Hierarchical priors are typically employed when ĝ is a 
vector (6,, 4,,..., 4,), and the 6, are thought to be independent realizations from 
a common prior distribution g. Typically g is assumed to lie in some class 
rT, = {g,: wo E Q} of distributions, often the class of conjugate priors, and a 
“second stage” prior A, is placed on this class, i.e., on w. Such a hierarchical prior 
can, of course, be written as a single prior, namely 


(5.1) m(0) = ANEO 


(We restrict ourselves to densities in this section, for convenience, and also will 
not consider hierarchical priors with more than two stages.) Development of and 
references for this approach can be found in Good (1980), Lindley and Smith 
(1972), Morris (1983), and Berger (1985). 

There are three possible robustness concerns in working with (5.1). One could 
question the assumptions (i) that the 8, are i.i.d.; (ii) that the prior g belongs to 
T; and (iii) that A, is specified correctly. Each of these concerns deserves careful 
consideration separately but in the following we will simply deal with uncer- 
tainty in the second stage (i.e, Ay), or in both the first and second stages 
together. 

Simultaneous uncertainty in different stages or aspects of a prior can often be 
expressed most simply by allowing more than one contamination in the e-con- 
tamination model. For instance, one could consider 


h,(w) dw. 





(5.2) [= { 1: 7 = (1 — &™ Ez) To + £19, + 8292, Qi E 215 Vo e251, 


where 2, and 2, are appropriate possible classes of contaminations. Such an 
extension of the e-contamination model vastly increases its flexibility while 
causing no real hardship in many applications, because the important formulas 
(1.4), (1.7), and (1.8) become simply 


(5.3) m(x|\7) = (1 — £ — e,)m(x|ro) + e:m(x1q,) + eam(zxlq2), 
(5.4)  8%(x) = [1 — A(x) — Ag(x)]S(x) + Ay(x)S%(x) + AQ(x)5%(x), 
(5.5) V(x) =(1—A, — Ao) V + AV + AVO +A AEn — 6%)? 

+(1— A, —A,Q)A,(8% — 6%)” + (1 — A, —AyQ)Ag(S% — 8%)", 
where A(x) = €,m(x\|q')/m(x|7) for : = 1,2. Thus one can find the ML-II prior 


ROBUST BAYESIAN ANALYSIS 479 


by separately maximizing m(x|q,) and m(x|q,) in (5.3) (unless 2, and 2, are 
related in some fashion) and then easily calculate the resultant ML-II posterior 
mean and variance. 

Before proceeding, it is worthwhile to note that T of the form (5.2) might be of 
interest in other than hierarchical prior situations. Indeed, whenever one has 
several possible models in mind for the contamination, or even for 7, itself, the 
uncertainty can be reasonably represented by such a T. 


5.2. Second stage uncertainty. Suppose, in the situation of Section 5.1, that 
only Ah, is deemed uncertain. (Knowledge at higher levels of hierarchical priors 
will often be more vague than at lower levels.) An -contamination model for h 
would be 


(5.6) hlo) =(1—e)hp(w)+es(w), 8 ES. 


The resulting prior for @ is 


GI (0) =f] TJE) do = (1 = mal) + eal), 


where 


p p 
l0) = f[T]2.(@,)|holw)do amd a0) = f| T18.(6,)]s(0) do 
t™~ pes 
Letting 2 = {q: s E S}, it follows that the uncertainty in 7 can be expressed by 
[= {m m =(1 — ejm + eg, g © 8}. 
In determining the MI-II prior for this situation, it will be convenient to 
define 


(xia) = fi) Tle.) a 


which is clearly the marginal distribution of X under the assumption that the 
prior for 6 is [T1?_,g.(6,)]. Note that 


(5.8) m(x|r) = (1 — e)m(x|2,) + e f m(x|o)s(w) dw. 


When S= # = {all distributions}, it is clear from (5.8) that 


sup m(x|7) = (1 — e)m(x|z,) + esup m(x|w). 
fer, W 


Assuming that m(x|w) has a maximum at ĝ, it follows that the ML-II prior is 
P 

Teat), 

į =u 


for which analysis is usually quite straightforward. 


ABO) = (1 — ejm (6) + 





EXAMPLE 3. Suppose that X = (X,,...,X,) ~ %,(0,07E,), o? known, and 
that the first-stage prior information is that the 6, are independent with a 


J 


480 J. BERGER AND L. M. BERLINER 


common “M (u, T?) distribution, to be denoted g., with w = (p, T?) unknown. 
Note that m(x|w) is M (ul, (0° + 7?)I,), where 1 = (1,...,1). It is easy to check 
that m(x|w) is maximized at 


1 P 
& = (R, ??) = |z maxo, a (a. = x) — al 
t= 1 
Hence, with contaminated second-stage prior as in (5.6) and “= 2P, the ML-II 
prior is 7(8) = (1 — e)m(8) + eĝ(8), where ĝ is W,(fil, 77I,). 
As a very special case, suppose họ is a point mass at (uo, Të), so that m is 
simply M (uol, 1J,). Then the ML-II posterior is 
fi(Blx) = A(x) r (Bix) + (1 — A(x))G(8[x), 
where 7,(8|x) is 4(67(x), vol), Ox) is M84, ôI), vo = o?r /(o? + 7), 
ô = 0??? /(0? + #7), 
2 2 


6%(x) = x — (x = pol),  8%x) = x - 


o* + 4, o? + 7? (x — Al), 
and 


A(x) = (1 — e)m(x|m9)/[(1 — e)m(x|my) + em(x\9)] 
~ i = (1 - e) “(07 + d) apl È (a = Mo) /[2(0? T aeo ' 


where 





o-Pexp| - 2 (x= 2)°/2p) if L(x,- x)" < po?, 
PR i a T Sp 
E >» (x, - J exp{—ip} otherwise. 


i=] 


Note that 6 is the usual conjugate prior estimate of 6, while of is the usual 
empirical Bayes estimate of @. The overall posterior mean [see (1.7)] is thus 


6% = A(x)d%(x) + (1 — A(x))d%(x), 


which will be close to 6" if the x, are close to uo, and close to 8% if the x, are 
similar but far from po. 

Of course, only rarely will it be appropriate to choose A, to be a point mass. 
More natural would be a choice such as h,(p, T?) = w(p)o(r?), where w(u) is 
N (Ho, A) and v is, say, a gamma distribution. Although the ML-II posterior, is 
no longer expressible in closed form for such a situation, the posterior mean and 
variance can be written in a form involving a single numerical integral over 7” 
[see, e.g., Lindley (1971)]. 

Several features of the above example are worth noting. First, the strong 
relationship of the ML-II theory with standard empirical Bayes analysis is 
apparent. Indeed, if one were to choose €= 1, the standard empirical Bayes 


ROBUST BAYESIAN ANALYSIS 481 


situation would result. As mentioned in the introduction, we much prefer the 
analysis with reasonably small e, the choice e = 1 resulting (typically) in there 
being a large number of unrealistic priors in T. Of course, the choice 2 = # also 
suffers somewhat from this deficiency, as discussed in Section 2.3. An appealing 
possibility in the above example is, therefore, to attempt to apply the ideas of 
Section 3 (or possibly Section 4) and work with more reasonable 2. For instance, 
if independence of u and 7? can be assumed, so that A(p, 7°) = w(u)v(7”), one 
could elicit wy and vo, consider 


W= {w = (1 — e) + £go: Qu 18 unimodal, symmetric about the 
mode (or perhaps median) of w} 


Y= {v = (1 — £) + €09,: q, is unimodal, symmetric about the 
mode (or median) of vo}, 


and apply the ideas of Section 3. We do not attempt the analysis here, because 
nothing new conceptually is involved and the argument would be moderately 


lengthy. 


5.3. First and second stage uncertainty. The simplest modification of (5.7) 
that introduces uncertainty in the first stage of the prior is simply to add an 
arbitrary overall contamination. Thus we consider 


a(@) = (1 — & Eg ) To + &gi + Ego: 
where q, € 2, = F, 


q, = f|Te.t@)]ot) de El = {q:sEF}, 


and 7), s, and S are as in (5.7). In other words, q, arises from possible second 
stage prior uncertainty, while q, allows for basic error in the empirical Bayes 
model. 

Allowing arbitrary q, is again, probably excessively crude. In particular, 
complete abandonment of the empirical Bayes structure may be unrealistic. For 
illustrative purposes, however, this is convenient. 

As mentioned in Section 5.1, the ML-II prior can be found (here, at least) by 
separately maximizing m(x|g,) and m(x|q). Maximization of m(x|q,) was 
discussed in the previous section. And m(x|q,) will simply by maximized when g, 
is a unit point mass at 6, the maximum likelihood estimate. Thus the ML-II prior 
is [assuming “= ¥ and letting (f) denote a unit point mass at 6] 


a ga(6.) 


Formulas (5.3)-(5.5) can now easily be employed to give desired conclusions. In 
the situation of Example 3, for instance, all calculations can be carried out 
explicitly; indeed, the needed modifications to the formulae there are very minor 
and so will be omitted. The behavior of 5*, the ML-II posterior mean, is worth 


#(6) =(1—- & 7 ea) m (6) +e 





+ e,1(6). 





482 J. BERGER AND L. M. BERLINER 


mentioning, however. If the data are compatible with zp (i.e., are near po) then 
the conjugate prior posterior mean 6 ° will dominate; if the data are similar but 
not near u, then ô? will be close to the natural empirical Bayes rule 8%; and if 
the data are not compatible with the empirical Bayes model, then 6* will be close 
to the maximum likelihood estimate, ĝ = 


6. Discussion. We view this paper as a hopeful first step in the development 
of systematic robust Bayesian analyses for rich classes of priors. The original 
goals of the paper were (i) to demonstrate that it is possible to work with complex 
classes of priors (as in Section 4), and to indicate mathematical techniques for 
doing so; (ii) to point out the numerous intuitive and calculational reasons for 
approaching robustness through consideration of e-contamination classes; and 
(iii) to exhibit the value of the ML-II approach in obtaining “robust priors.” 
Through tenacious prodding from skeptical referees and the associate editor, we 
have become more cautious in our assessment of success in goal (ili). A brief 
discussion of this issue is in order. 

The key to obtaining a robust prior appears to be the selection of a prior with 
tails that are much flatter than the tails of the likelihood function [see Berger 
(1984, 1985) for discussion and references]. Unfortunately, this observation does 
not provide a readily implementable “solution” to robustness questions. Basic 
difficulties include (i) the uncertainty as to the choice of robust prior tails and (ii) 
the calculational complexities that can result. In addition to the already estab- 
lished computational simplicity, our hope for the ML-IT technique was that it 
would automatically provide a “prior” robust against the type of deviations 
considered plausible. 

It is important to explain our reasons for believing that ML-IJ would succeed 
when applied to e-contamination classes. Suppose that e is fairly small and that 
l contains all plausible priors, but none that are terribly implausible. Consider 
first the case where m(x|7,) is large, i.e., the data is compatible with the 
nominally specified 7). This is a situation of nearly automatic robustness, in that, 
since the central portions of all 7 in T (and, hence, 7) will be similar to that of 
Ty, the conclusions will be very similar for all 7 in I. On the other hand, consider 
the case where x is such that m(x|7,) is small. This is precisely the situation in 
which the prior tail is highly influential and the use of a large prior tail is 
desirable. The ML-II technique will naturally select a prior with a large tail, since 
such priors are those most compatible with the data. Opposing the above 
encouraging tendencies toward robustness, is the danger that data-selected priors 
will tend to over-concentrate about the likelihood function, thereby yielding error 
estimates that are too small. (Such dangers lurk in the shadows of much of 
empirical Bayesian analysis and for that matter the use of data-selected models.) 
When [I contains unreasonable priors, as in Section 2, this problem can com- 
pletely dominate. However, the hope was that reasonable I’, such as those in 
Sections 3 and 4, by limiting the possible concentration about the likelihood 
function, would not succumb to this danger to a serious extent. 

To examine the degree to which this hope was realized, we return to the 
example discussed in Sections 2-4: X ~ W(6,1), Ty is ~ (0,2), and e = 0.2. The 


ROBUST BAYESIAN ANALYSIS 483 


four estimators of @ presented were (changing notation for convenience): 


5,, the Bayes estimator with respect to mo; 

5,, the ML-II posterior mean for 2, = {all distributions}; 

ô, the ML-II posterior mean for 2, = {all symmetric, unimodal distri- 
butions}; 

5,, the ML-II posterior mean for 2, = {q such that 7 is unimodal}. 


Let us add to this list 4,, the posterior mean for a t-prior distribution with 4 
degrees of freedom, median zero, and quartiles equal to +0.96 (the quartiles of 
7); this corresponds to a scale factor of 1.3 for the t-distribution. The point of 
including the ¢-prior is that it is a reasonable robust prior (which happens to be in 
all the classes [ for 2,, 2,, and 2,), and provides a benchmark for judging the 


DELTA 





484 J. BERGER AND L. M. BERLINER 


p| + ----Ss 


8.80 10.10 


p.00 
O 
© 
MN 
uy 
a 
{35 
mH 
[ aw) 
f a 
LO 
© 
Oy 
MN 
oa 
~J 
n 
(a 


performance of ML-II. Figure 1 presents graphs of 6, through ô; for 2 < x < 10 
(there is very little difference between the estimators for x < 2). Figure 2 presents 
the corresponding natural error estimates, namely, the posterior (ML-II versions 
for 5,, ô, and 6,) standard deviations s,(x), 8o(X),..-, S5(X). 

The most obvious conclusion from Figure 1 is that the ML-II estimates are 
more conservative (closer to x) than the Bayes estimates. Also, quite naturally, 
larger 2 result in more conservative estimates (note that 2, C 2, C 2,). 

Turning to Figure 2, note first that s(x) is indeed fairly ridiculous for large x. 
Also, s,(x) is moderately smaller than s,(x). We had hoped that the attractive T 
of Section 4 would not be so large as to result in an excessively small ML-II error 
estimate, but when compared to s,, the behavior of s, is borderline. On the other 
hand, the behavior of s, is definitely satisfactory. Finally, note that for moderate 
x, both s, and s, rise above the conditional standard deviation (1) of X. [The 


ROBUST BAYESIAN ANALYSIS 485 


behavior of s, is in fact quite general for robust priors; for example, see O’Hagan, 
(1981)]. 

To summarize, we proposed the ML-II technique as a possible automatic 
‘“yobustifier” of standard (conjugate) priors. We have shown that the ML-II 
technique is calculationally feasible, and that it can successfully robustify 7, if 
the class T is sensible (i.e., does not contain silly contaminations such as those in 
2,). Alternatively, in any given problem one could (should?) attempt to construct 
a robust prior, such as the t-prior in the above example. We are in no way 
opposed to such efforts, but there are numerous technical and theoretical issues 
involved in insuring that robustness is obtained; the process is far from auto- 
matic. While we are not strong proponents of “automation” in statistics (few 
Bayesians are), we recognize the forces driving statistics in that direction. Of 
course, nothing is completely automatic; our ML-II approach does require the 
imputation of £ and 2. However, it will often be reasonable to choose 2 in a 
standard or default fashion, say as in Section 3. The quantity e is reasonably 
accessible to intuition, relating in a fairly straightforward manner to one’s 
confidence in the specification of 7. 


Acknowledgments. We would like to thank the referees and Associate 
Editor for very valuable comments and suggestions. 


REFERENCES 


BERGER, | (1982). Bayesian robustness and the Stein effect. J. Amer. Statist Assoc. 77 358-368. 

BERGER, J. (1984). The robust Bayesian viewpoint (with Discussion). In Robustness in Bayesian 
Statistics (J. Kadane, ed.). North-Holland, Amsterdam. 

BERGER, J. (1985). Statistical Decision Theory and Bayesian Analysis. Springer, New York. 

BERGER, J. and BERLINER, M. (1983). Robust Bayes and empirical Bayes analysis with e¢-con- 
taminated priors. Technical Report 83-35, Purdue Univ., West Lafayette, Indiana. 

BERGER, J. and BERLINER, M. (1984). Bayesian input in Stein éstimation and a new minimax 
empirical Bayes estimator. J. Econometrics 26 87-108. 

BERGER, J} and SELLKE, T. (1984). Testing a pont null hypothesis: the irreconcilability of signifi- 
cance levels and evidence. Technical Report 84-27, Purdue Univ., West Lafayette, Indiana. 

BICKEL, P. J. (1984). Parametric robustness or small biases can be worthwhile. Ann. Statist. 12 
464-879. 

HisHop, Y. M. M., FIENBERG, S. E. and HOLLAND, P, W. (1975) Discrete Multtuartate Analysis. 
M.I.T Press, Cambridge. 

BLUM, J. R. and ROSENBLATT, J. (1967). On partial a prion information ın statistical inference. Ann. 
Math. Statist. 38 1671—1678. 

Box, G. E. P. (1980). Sampling and Bayes’ inference in scientific modelling and robustness (with 
Discussion). J. Roy. Statist. Soc Ser. A 143 383-430. 

Box, G E P. and Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis. Addisor.-Wesley, 
Reading, Mass 

DEMPSTER, A. P. (1975). A subjectivist look at robustness. Bull. Inst. Internat. Statist 46 349-374, 

DEROBERTIS, l. and HARTIGAN, J A. (1981). Bayesian inference using intervals of measures. Ann. 
Statist. 9 235-244. 

Goop, I. J. (1950). Probabuity and the Weighing of Evidence. Griffin, London. 

Goon, I. J. (1965). The Estumaton of Probabuities. M.I T. Press, Cambridge. 

Goon, I. J. (1980). Some history of the hierarchical Bayesian methodology. In Bayesian Statistics (J. 
M Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, eds.). University Press, 
Valencia. 


486 J BERGERAND L M BERLINER 


HUBER, P. J. (1973). The use of Choquet capacities in statistics. Bull. Inst. Internat Statist. 45 
181-191. 

JEFFREYS, H. (1961) Theory of Probability. 3rd ed. University Press, Oxford 

KADANE, J B., DICKEY, J. M., WINKLER, R. L, SMITH, W S and PETERS, S C. (1980). Interactive 
elicitation of opinion for a normal linear model, J. Amer. Statist. Assoc, 75 845-854. 

LEAMER, E. FE. (1978). Specification Searches. Wiley, New York. 

LEONARD, T. (1974). A modification to the Bayes estimate for the mean of a normal dist:ibution. 
Biometrika 61 627-628 

LINDLEY, D. V. (1971). The estimation of many parameters. In Foundations of Statistical Inference 
(V P. Godambe and D A. Sprott, eds.). Holt, Rinehart, and Winston, Toronto 

LINDLEY, D V. and SMITH, A. F. M. (1972) Bayes estimates for the linear model. J. Roy Statist. 
Soc. Ser. B 34 1-41. 

MARAZZI, A. (1985). Robust Bayesian estimation for the near model. Statistics and Decisions 3. 

MARITZ, J. S. (1970). Emputcal Bayes Methods. Methuen, London. 

Morais, C (1983). Parametric empirical Bayes inference: theory and applications (with Discussion). 
J. Amer. Statist. Assoc. 78 47-65. 

HAGAN, A. (1981). A moment of indecision. Biometrika 68 329-330. 

SCHNEEWEISS, H. (1964). Eine Entscheidungsregel fur den Fall partiell bekannter Wahrscheinlich- 
keiten. Unternehmensforschung 8 86-95. 

ZELLNER, A. (1985). On assessing prior distributions and Bayesian regression analysis with g-prior 
distnbutions. In Bayesian Inference and Decision Techniques with Applicatons (P. K. 
Goel and A. Zellner, eds.)}. North-Holland, Amsterdam. 


DEPARTMENT OF STATISTICS DEPARTMENT OF STATISTICS 
PURDUE UNIVERSITY OHIO STATE UNIVERSITY 
WEST LAFAYETTE, INDIANA 47907 1958 NEIL AVENUE 


COLUMBUS, OHIO 43210 


The Annals of Statutics 
1988, Vol 14, No 2, 487-501 


CHARACTERIZATION OF EXTERNALLY BAYESIAN POOLING 
OPERATORS 


By CHRISTIAN GENEST, KEVIN J. McConway, 
AND MARE J. SCHERVISH 


University of Waterloo, The Open University, 
and Carnegie-Mellon University 


When a pane! of experts 1s asked to provide some advice ın the form of a 
group probability distribution, the question arises as to whether they should 
synthesize their opinions before or after they learn the outcome of an 
experiment. If the group posterior distribution ıs the same whatever the 
order in which the pooling and the updating are done, the pooling mechanism 
is said to be externally Bayesian by Madansky (1964). In this paper, we 
characterize all externally Bayesian pooling formulas and we give conditions 
under which the opinion of the group will be proportional to the geometric 
average of the individual densities. 


1. Introduction. Let (9, p) be a measure space and let A be the class of 
-measurable functions f: © — [0,00) such that f > 0 p-ae. and ffdu = 1. In 
the language of multiagent statistical decision theory (cf. Weerahandi and Zidek, 
1981), a pooling operator is any function T: A” — A which may be used to 
extract a “consensus” T(/,,..., fh) from the different subjective opinions 
fi ---s f, E A of the n members of a group. The current interest for pooling 
operators seems to stem from a theorem due to Wald (1939) concerning the 
optimality of Bayesian decision rules. When formulated in the context of a group 
decision problem, this theorem suggests that at least in the case where all the 
members of the group have the same utility function, it is generally preferable 
for them to agree on an “average opinion” T( f,,..., fh) and to adopt that action 
which maximizes their common utility with respect to T(/,,..., fa), rather than 
to take an “average decision” based on the optimal decisions of each of the 
individuals (see de Finetti, 1954). 

A few years ago, Madansky (1964, 1978) suggested the use of pooling formulas, 
T, which have the following property: 


rfi fot ds. ta) fed 


=IT(f,,..., f) fer f... f,) Gp, pae, 
whenever /: © — (0, œ) is such that 


(1.1) 


(1.2) 0< fif,dp < oo, i=1,...,n. 


Received October 1984; revised June 1985. 
AMS 1980 subject classyfications. Primary 62A99; secondary 39B40. 
Key words and phrases. Consensus, expert opinions, external Bayesianity, logarithmic pool. 


487 


488 D. GENEST, K. J. MCCONWAY, AND M. J SCHERVISH 


A function } which satisfies condition (1.2) above is hereafter called a lzkelihood 
for (f,,...,f,)- Implicit in the statement of (1.1) is that the integral 
JIT f,,..-, fh) du is finite whenever / is a likelihood for (f,,..., fh) 

A different but obviously equivalent formulation of Madansky’s condition 
would be to specify that 


(1.3) f1/8ı ess X f,/8,%1= T( prs fa)/ T(E- -3 En) x L. 


Pooling operators which obey (1.1) or (1.3) are called externally Bayesian. By 
adopting an externally Bayesian formula, a group assures itself that when 
additional information / is perceived jointly, their collective opinion can be 
modified (using Bayes rule), producing the same result as if the pooling operator 
had been applied after each individual distribution has been revised. Raiffa 
(1968, pages 221~226) shows with reference to a concrete example how using a 
pooling formula which fails (1.1) may lead the members of the group to act 
strangely. In Raiffa’s example, as it were, all the individuals try to increase the 
influence of their opinion on the consensus by insisting that it be computed 
before the outcome of an experiment is known. This happens because the 
members of the group know that whether the f,’s are updated or not, the 
consensus will be computed using the same weighted arithmetic average of their 
opinions, viz. 
n 


(1.4) T fico fa) = Loh, pae 


t=] 


a formula which violates (1.1) unless w, = 1 for some ¿ and w, = 0 for all 7 #2. 

To ensure that the order in which the pooling and updating are done is 
immaterial, Bacharach (1972) suggests that the consensus should be computed 
using a logarithmic pooling operator, viz. 


as) Mhoch) = Tie [JTL ide, was 


where W,..., W, are nonnegative weights such that Èr w, = 1 as in (1.4). 
According to Bacharach, it is Peter Hammond who first observed that the 
operator (1.5) is externally Bayesian. In a recent article, Genest (1984b) has 
shown that this is also the only solution of the functional equation (1.1) when 
there exists a function G: (0, 00)” — (0,00) which is Lebesgue measurable and 
such that 


(1.6) T( fises f, (8) & G( fi(6),..., f,(8)), p-8.€., 


where the proportionality constant must be independent of 6. (A precise state- 
ment of this result is to be found in Section 2.) The formula (1.6) means that, 
except for a normalizing constant, the value of T at a particular 0 depends on 
the /,’s only through their values at 8. The import of this theorem is still 
limited, however, especially because the proof given in Genest (1984b) does not 
apply when the space (9, u) is purely atomic, an assumption which excludes the 
important case where @ is finite or countable. 


EXTERNALLY BAYESIAN OPERATORS 489 


In this paper, we will provide all the solutions of the functional equation (1.1), 
that is, without restriction to those operators which obey (1.6) and without 
imposing regularity conditions on the space (©, u). The form of the solutions is 
given in (2.1) below and is worked out explicitly in Section 3 for the case where 
individual opinions are represented as probabilities of a single event. When the 
function G in (1.6) is indexed by @ and (0, u) can be partitioned into at least 
four nonnegligible sets, we show that all the solutions must be of the form 


(1.7) T( TA dag e T14" | fe Li dn, H-8.€., 


g being an arbitrary bounded function on © and w,,...,w, being arbitrary 
weights adding up to 1. Pooling operators of the form (1.7) have already been 
characterized by McConway (1978), but only in the case where the measure u is 
purely atomic (thereby forcing © to be countable). Obviously, the weights in 
(1.7) can be negative, but only when © is finite. 

It may not always be reasonable to require a pooling formula to satisfy the 
criterion of Madansky (1964, 1978), viz. (1.1). Such would be the case if, for 
instance, the group itself were not required to make a decision, but rather were 
only asked to provide a group opinion to an external decision maker. In 
situations where a decision maker is present, French (1981, 1985) and Lindley 
(1985) have argued rightly that it would seem more reasonable for this decision 
maker to adopt the Bayesian approach and to treat the opinions of the members 
of the group as data which he/she would use to update his/her own subjective 
probability distribution. In certain circumstances, French and Lindley have 
shown how the formula T(f,,..., fa) representing the decision maker’s opinion 
after hearing the experts could well violate (1.1). In this paper, however, we 
address what French (1985) would call the “group decision problem,” that in 
which there is no natural decision maker and the group is either unwilling or 
unable to provide one. 

Whether it involves a decision maker or not, the problem of determining a 
sensible formula for representing the opinions of a group has received a lot of 
attention in recent years. Let us mention, for example, the papers of French 
(1980, 1981), Morris (1974, 1977), Winkler (1968, 1981), and Genest and Schervish 
(1985), all of which adopt the Bayesian viewpoint. References on the so-called 
“group problem” include Laddaga (1977), McConway (1981), Wagner (1982, 
1984), and Genest (1984a, c). An extensive bibliography has recently been 
compiled on both versions of the problem by Genest and Zidek (1986). 


2. Characterizing externally Bayesian pooling operators. First observe 


that the class A is nonempty if and only if the measure u is o-finite, and that the 
operator (1.7) is well-defined since 


0< fall itdy < hela [I| fidu" = igl << 


by Hölder’s inequality. For convenience, we will assume that every singleton 


490 C. GENEST, K J. MCCONWAY, AND M.J SCHERVISH 


subset of © is measurable. The following theorem is a slight modification of a 
theorem of Genest (1984b). 


THEOREM 2.1. Let (©, ) be a measure space, and suppose that u is not 
purely atomic. Let also T: A — A be a pooling operator for which (1.6) holds. 
Then T is externally Bayesian if and only if it ts logarithmic, i.e., if and only if 
there exist nonnegative weights w,,...,w, such that X! w, = 1 for which (1.5) 
holds. 


Proor. The proof is the same as in Genest (1984b), since the o-field on 
which u is defined always contains nonnegligible sets with arbitrarily small 
measure, except in the case where the measure u is purely atomic. (See Halmos 
(1950), Exercise 1, page 174). D 


In the following, our main objective is to generalize Theorem 2.1 by char- 
acterizing all the pooling operators which have property (1.1) without imposing 
condition (1.6), and without restricting the underlying measure space (O, u). To 
do this, we first consider the case in which the “group” consists of a single 
expert, and we show that every externally Bayesian pooling operator is then of 
the form (1.7). 


THEOREM 2.2. Let T: A—A be a pooling operator. Then T ıs externally 
Bayesian if and only if there exists a bounded function g: © —> [0, œ) such that 
g > 0 pa.e. and 


T(f) gf] [eide H-a.e., 
for all u-densities f in A. 


Proor. Let f and A be arbitrary in A. Set g = T(h)/h and consider the 
likelihood function ¿= f/h. Since T is externally Bayesian, we have 


T(f)= rinj find) = 1r(n)/ fra) dy at] fet du. wae., 


for all fe A. We also have fgfdu < œ for all f, which implies that g is 
essentially bounded. (See, for example, Theorem 20.15 in Hewitt and Stromberg 
(1965).) The definit?on of T does not depend on the choice of h, since g is unique 
up to a constant multiple. O 


In the case where n > 1, the basic idea consists of reducing the problem to the 
context of Theorem 2.2 by dividing the domain of T into equivalence classes in 
such a way thar, given the value of T at one member of an equivalence class, the 
externally Bayesian property defines the value of T at all other members of that 
class. 


EXTERNALLY BAYESIAN OPERATORS 491 


DEFINITION 2.3. Two vectors (f,,..., fh) and (/,*,..., f,*) in A” are said to 
be equivalent and we write (/,,..., fa) ~ (f,*,.--, f,*) IÉ and only if 


Vi, sic,,>0 suchthat f/f=c, (/*/f,*), wae. 


It is obvious that ~ is an equivalence relation on A” and that two vectors of 
densities belong to the same equivalence class if and only if there exists a 
likelihood function J: © — (0, œ) such that 


ag =y.) f f,dp, = p-a.e., 


for all ¿= 1,..., n. In the sequel, we use the Greek letter a to refer to an 
arbitrary element of the quotient-space £ = A"/ ~ . For each a € £, we denote 
(f*,.--, f°) an arbitrary but fixed vector of -densities in a. That such a 
representative can be chosen for each equivalence class a follows from the Axiom 
of Choice. 


EXAMPLE 2.4. Let (/7,..., f) be a vector of exponential densities such that 
f(O) = A,exp{ ~A,6} for 0 > 0, A, being a nonnegative parameter, i = 1,..., n. 
It is easy to see that a vector (/,,..., f„) belongs to the same equivalence class as 
(f7,..-, A) if and only if ff,exp{(A, — 4,)@} d@ < œ for all k = 1,..., n, and 
f, x fyexp{(A, — A,)@}, pa. for z: > 1. The condition on f, means that its 
moment generating function is finite at each point A, — Àp, k = 2,..., 7. 


We are now in a position to state the main result of this section. 


THEOREM 2.5. Let T: A" — A be an arbitrary pooling operator. Then T 1s 
externally Bayesian if and only if 


(2.1) T( fiss "ay fn) x bv, A ee -a.€ 


where (using the above notation) a represents the equivalence class of 
(f,,---> fi) and for each a, b, is some essentially bounded function and v, is 
some function such that v, = max{ ff,..., ff) u-a.e. 


PRooF. It is easy to see that for any fixed b, and v,, pooling operators of the 
form (2.1) are well-defined and externally Bayesian. The crux of the proof 
consists in showing that these are the only ones. 

For each a € £, denote h, = T( fE, ..., f£). For an arbitrary (f,,..-, fa) 
which is equivalent to (f,7,..., f°), consider the likelihood function / = f,/f/. 
Since f,"/ff = c, 1. f/f), pae., one has fH" du = c,;, < oo and /fS/flf,“ dp =f, 
for all i = 1,..., n. Since T is externally Bayesian, it follows that 


PU fisc., fO CITC fs RE) =i aif, mae, 


and this remains true as long as (/,,..., fa) belongs to the equivalence class a. 

To complete the proof, assume that v, > max{f7,..., F} w-a.e. To show 
that A,/v, = b, is essentially bounded, pick an arbitrary g in A and define 
fi = ef /v,, uae. Since ff, f*/ff du < oo, for all ¿= 1,..., n, we can define 


492 C. GENEST, K.J MCCONWAY, AND M. J. SCHERVISH 


faseer, fy € A such that (fy. fy) ~ ees fe) by putting fo fi f/f for 
1 > 2. Then T(f,,-..., fh) X gh,/v, and hence 


feh ta du < 00 


for all g € A. The conclusion now follows from Theorem 20.15 in Hewitt and 
Stromberg (1965). 5 


At first glance, it may appear as though formula (2.1) is based solely on the 
opinion f, of the first expert, which is intriguing. In reality, however, the 
opinions of all the individuals influence the consensus since it is necessary to 
know them all in order to determine the equivalence class corresponding to the 
vector (fi... fh) Note that each equivalence class is characterized by an 
arbitrary vector (f7,..., fy) in the class or, equivalently, by a component fp 
and the ratios /,°/f;, which ratios are invariant (up to constant multiples) 
within a given class. For this reason, the operator (2.1) could also be written in 
the form 


Ires fp) OCOD ie: [-@..€., 
or alternatively as 


n 


(2.2) T( fiver fn) © ba lL)" wae, 


t=} 
provided that the weights w, add up to 1. In principle, a different set of weights 
could even be chosen for each vector (f,,..., fa), since the ratios f,/f,* are equal 
to one another up to constant multiples. Another trivial observation is that 
every measurable function is bounded when © is finite. In this case, therefore, 
the requirement that b, be bounded is vacuous. 

Formula (2.1) is more general than the logarithmic opinion pool (1.7), but this 
operator is included as a particular case. To verify this, it suffices to choose the 
function b, in (2.2) equal to gI17_,( f,*)"/vo,, where g is essentially bounded. 
This choice is legitimate since [12 ( f,7)"/o, < IVC f,7)"/max{ f?,..., fi} s 1, 
p-a.e. More generally, the operator 


This fn) = EIRO fe IAO du, paes 
p 


i=] 
is well-defined end is externally Bayesian when the functions g, are essentially 
bounded and 2"_,w,(a) = 1. That is, the function g and the weights w, in (1.7) 
may vary with the equivalence class to which the vector (/,,..., fh) belongs 
without conflicting with (1.1) or (1.3). 

In order to synthesize the group’s opinions using a formula of the form (2.1), it 
will generally be necessary to first determine the equivalence class to which the 
observed vector of opinions belongs. In the following section, we will show what 
this involves in the specific case where each individual in the group is asked to 
provide his/her subjective probability for the realization of an event of interest. 


3. The event case. In this section, we consider ın detail the special case in 
which the space © consists of only two points, say © = {0,1}. This is the case in 
which each expert opinion can be thought of as the probability assigned to the 


EXTERNALLY BAYESIAN OPERATORS 493 


event E = {@ = 1}. In this case, the equivalence class to which a particular 
vector of probability assessments belongs is easy to construct. First note that 
each probability assessment consists of two positive numbers adding to unity, 
say p and 1 — p. (The requirement that p be strictly between 0 and 1 is 
equivalent to the general assumption that the experts’ densities are strictly 
positive almost everywhere.) A vector p = ([p,,1 — p,],..-,[D,.1~p,]) of 
such assessments is equivalent to another such vector q = ([q,,1 ~ q,],..., 
[qan 1 — q,]) in the sense of Definition 2.3 if and only if, for each i = 2,..., n, we 
have a constant k, such that 


(3.1) P/P, = kg./q,0 — p,)/0 p) = 2,0 - ¢,)/0 - 9). 


If we let Z, and 2, be the odds ratios p,/(1 — p,) and q,(1 — q,), respectively, 
(3.1) simplifies to Z/P, = 2,/2, for i = 2,..., n. Hence two vectors of probabil- 
ity assessments for an event are equivalent if and only if the pairwise ratios of 
assessed odds within one set are equal to the corresponding ratios in the other 
set. For example, with n = 2, the two vectors ([0.5, 0.5], [0.7, 0.3]) and (¢[0.75, 
0.25], [0.875, 0.125]) are equivalent since the odds ratio of the second expert is 
7/3 times as big as the odds ratio of the first expert in each case. The equivalence 
class a(p) to which a vector p of probability assessments belongs can be char- 
acterized by the n — 1 coordinates of the vector P = (P/P i- Z/P). That 
is, the equivalence classes are in one-to-one correspondence with the n— 1 
dimensional vectors of positive numbers. The equivalence class corresponding to a 
vector (a,,...,a@,) contains all vectors of probability assessments such that 
P, = a,?,. A canonical representative p“ can be chosen from each class a with 
P = 1, and we can identify a with the vector (ao,..., a,). 

Next, we will look at what the externally Bayesian formulas are. Without loss 
of generality, we can assume that v, in Theorem 2.5 is identically 1 for all a, 
since all densities are probabilities in this problem. It follows that each exter- 
nally Bayesian formula can be represented as 


b,(1) Pp; 5,(0){1 — p,) 
b,(1)p, + 6,(0)(1 — py)” bal) pi + 8,(0)(1 - pi) | 


where b, is an arbitrary (bounded) function on © for each a, and a = 
(P A/P.. P/P). Another way to express (3.2) is to say that the odds ratio 
for T(p) is 7,5,(1)/8,(0). Now it is trivial to see that we can assume 0,(0) = 1, 
without loss of generality, by simply altering 6,(1). So the odds ratio for T(p) is 
just #,b,(1). For example, if 6,1) = 1, then T is the dictatorship that simply 
follows expert 1. Or if b,(1) = a, then T is the dictatorship that simply follows 
expert i. If 6,1) = rlf a®™ for arbitrary numbers w, i = 2,...,m, and r>0 
then 


(3.2) T(p) = 


qi? pe 


T re eee rs 
(p) gil, pe + 1 -— @)IT (1 — p,) 


(3.3) 
(b= gy =p) 
qi + Us q Ml- (1 = p)” l 


494 C. GENEST, K. J. MCCONWAY, AND M J SCHERVISH 


where w, = 1 — L’ w, and q = r/(1 + r). Formula (3.3) is an analogue of (1.7) 
in the event case. 

The form of tke most general externally Bayesian rule (3.2) is very general. It 
is possible, for example, for b (1) to vary in an arbitrary fashion as a function of 
a. However, it would not usually be desirable for a small change in p to produce 
a large change in T(p). It is easy to see that (3.2) is continuous as a function of p 
within each equivalence class a. In order for T to be continuous overall as a 
function of p, all that is required is that b,(1) be a continuous function of a (with 
the convention that 6 (0) = 1). For example, we could let the weights w, in (3.3) 
be continuous functions of a and obtain a generalized logarithmic pool with 
weights which depend on the degrees to which the experts differ. 


4. The logarithmic opinion pool. Genest (1984b) conjectured that if an 
externally Bayesian pooling operator T: A” — A were such that 


T( Linke fa) (0) 
üü G(8, Cy ee f9))/ fOC-, ET fa) dh, H-&.€., 


for some u X Lebesgue measurable function G: © x (0, 00)” — (0,00), then T 
must be a “modified logarithmic opinion pool,” viz. 


(4.1) 


n 


(4.2) Mhi h) eT if fa 1 dn wae, 


{ 


where g is an arbitrary bounded function on © and w,,..., w, are (not neces- 
sarily nonnegative) weights summing up to one. This result was actually proven 
by McConway (1978) in the case where © is countable and u is a counting-type 
measure. In this section, we will extend this result by removing the restriction 
that the measure space should be purely atomic. Indeed, the only assumption 
which we will make here is that (©, u) can be partitioned in at least four 
nonnegligible sets. We call a measure space which has this property quaternary, 
by analogy with the term tertiary introduced by Wagner (1982). 

Our proof is a hybrid of that of Theorem 2.1 in Genest (1984b) and an 
argument which was developed by McConway (1978) for the countable case. In 
accordance with the convention adopted at the beginning of Section 2, each 
6 € © will either be an atom or will have measure zero. We begin by addressing 
the case in which the measure space contains atoms. 


LEMMA 4.1. Let (©, ) be quaternary and let T: A" — A be an externally 
Bayesian pooling operator. Suppose that (©, u) contains at least two atoms and 
that there exists a u X Lebesgue measurable function G: © X (0, œ)" —> (0, œ) 
such that (4.1) holds for all f,,..., fa E A. Then for every pair of atoms (6,7) in 
©’, the identity 


T fise- AO) TOs- haO) 


(aa Tiassa fala) T(Ays +s And(n) 


EXTERNALLY BAYESIAN OPERATORS 495 


holds for all densities f,,..., f, and h,,...,h, E A for which 
(4.4) h(O)/h(n) = h (0)/h (n), t= 1,..., n. 


Before we present the proof of this lemma, let us mention parenthetically that 
the implication (4.4) = (4.3) is weaker but similar in spirit to the so-called axiom 
of relative propensity consistency (RPC) of Genest, Weerahandi, and Zidek 
(1984). In view of Lemma 4.1, it is not surprising, therefore, that their RPC 
condition should imply a logarithmic opinion pool, at least when the space (O, u) 
is purely atomic. 


Proor. First observe that if two vectors of p-densities (/,,..., fh) and 
(h,,..., A,,) satisfy (4.4) and belong to the same equivalence class, the conclusion 
is immediate from (2.1) in Theorem 2.5. Hence, the proof will be complete if we 
can show that whenever there exist /f,,...,f,, Ryp... R, in A such that 
1(9)/f(n) = hR (0)/hk (1), i= 1,..., n, such densities exist which belong to the 
same equivalence class. To do this, first set x, = f,(@), y, = f(n), x* = h,(@), and 
y* = hín) so that x,/x* = y,/y* =t,, say, for i= 1,...,m. Since (©, p) is 
o-finite and quaternary, there must exist a partition of © into four sets A, with 
0 < n(A,) < œ, for j = 1,2,3, u(A,) > 0, and such that x,y(A,) + yp(A2) < 1, 
and x*u(A,) + y*u(A,) <1, for all ¿= 1,...,n. In particular, we can take 
A, = {6} and A, = {n}. We will construct densities f,,..., f, and a likelihood Z 
for (fo... Å) such that {,(@)=x, f(m) =y, and 26) =x", h(n) = y, 
where A, = /f,/flf,dp, t= 1,...,n. 

Denote y, = x,2(A,) + y,u(A,), and note that 0 < y, < min{t,,1} for each i. 
This is true because 


xul A) + yu As) = t, | x*u(A,) + y*p(A,)| < ts l 


for all ¿ = 1,..., n. Choose 0 < A, & < œ such that 
A< min {([1-y] 4- y)} s max {([1- n] 4- yn) <é 


Fixing A € A an arbitrary density, now define 


5 a a 
ee ee Aya OAL) e a a NA) 
u(A3)(E — A) 

(4.5) 

¿(1 ~ Y.) T (t, T 1) 

R(= A) 

where R = {.%(A,)hdu > 0 and, in general, #(A) denotes the indicator of the 
set A. Clearly, ffidp =1 and f(0)=x,, f(n)=y, t=1,...,n. To generate 
the #,’s, consider the likelihood 
(4.6) 1 = .$(A,) + F(A) + ESCA) + AF(A,). 


It is easy to see that flf, du = t, so that h,(@) = x* and h(n) = yf, i =1,..., n. 
E 


hF(A,), 


496 C. GENEST, K.J MCCONWAY, AND M. J. SCHERVISH 


Lemma 4.1 shows that if T: A” > A is externally Bayesian and satisfies (4.1) 
for some function G on 8 x (0, 00)", then for all pairs of atoms (6, 7) in ©”, there 
must exist a Lebesgue measurable function Q(8, n): (0, 00)” — (0,00) such that 


1 fises Ín 6 p g l 6 


T( fo- fa) fna) hlan) 
We will now derive a more specific form for the right-hand side of (4.7). 


(4.7) 





LEMMA 4.2. In addition to the conditions of Lemma 4.1, suppose that (O, p) 


contains at least three atoms. Then, there exist constants v,,...,v,, such that 
for all x,,...,x, > 0 and every pair of atoms (6,) in 6”, we have 

n 
(4.8) Q(4,7)(x,,..-,x%,) = QC, n)(1) | lær, 


where 1 denotes the n-dimensional vector (1,...,1). 


Proor. In the same manner as Genest, Weerahandi and Zidek (1984), define 
new functions NQ(@, n): (0, 00)” —> (0, œ) by 
Q(8, E A Ly) 

Q(9,n)(1,.--,1) 

for all atoms 8 + ņ. Let 8, ņn, and ¢ be three distinct atoms in © and pick e > 0 
small enough that there exist densities in A which assume any of the values e, 
ex, or e/y, at any of these three atoms. Writing e = (e,..., €), X = (X,,...,X,); 
and y = ( Y --.; Yp) and assuming that all operations on vectors are performed 
componentwise, we have 


NQ(9,1)(X1,+--1%n) = 


G 0, ex)/G E 
NQ(@,1)(xy) = E 


G(f,e)/G(n,2) G(8, e)/G($, e) 


= NOCO, ANAC, n)y) 


for all x and y in (0, œ)". The argument now proceeds exactly as that beginning 
at (2.1) of Genest, Weerahandi, and Zidek (1984), except that in our case, the 
nonmeasurable solutions are automatically eliminated because G, Q, and hence 
NQ were assumed to be Lebesgue measurable. It follows that (4.8) holds for all 
X,,-.-,%X, > 0 and all pairs of atoms @ and 7. O 


To complete the proof of (4.2) for atoms, fix ¢ an atom in © and choose e 
strictly between 0 and 1/n(f{). For all atoms į € ©, now define g(@) = 
Q( 6, S)\(L)GLE, ele~*, where o = È w, Then for all atoms @, we find 

t 


E CPE EE Be) = 6(8) [ær 


EXTERNALLY BAYESIAN OPERATORS 497 
for all 0 < x,,...,x, < 1/n(@), which implies that 
(4.9) T( fise» hO) = a(9) I] no) fal fi” dp 
p= t~ 


for all atoms 6. 
Next, we derive a formula similar to (4.9) for those @ which are not atcms. 


LEMMA 4.3. Let T: A" > A be externally Bayesian and assume that (4.1) 
holds for some u X Lebesgue measurable function G: © x (0, œ)" — (0, œ). 
Assume also that (9, u) ts not purely atomic. Let N be the complement of the set 
of atoms. Then 


T( Vistas f,)(6) 
= g(6) T1cay"/ far. fo- fh) dp,  pra.e.onN, 


for some nonnegative weights w,,...,Ww, E R adding up to unity. 


(4.10) 


Proor. Define a new function NG: © x (0, 00)” —> (0, œ) by 
NG(6, z,,---,2,) = G(O, z,...,2,)/G(6,1,...,]) 
for all z,,..., 2, © (0,00). It will be enough to show that NG(@)(z,,...,z2,) 18 a 


function of the z,’s only, say NG(z,,..., Za). For, once this is done, we can define 
a new pooling operator T'*: A” -> A by 


Dyes oO) = NGOCO), -+s IOD) | [NGC hss fa) da: 


It is easy to see that T* is externally Bayesian and of the form (1.6) with G 
replaced by NG. We can then apply Theorem 2.1 to conclude that 


NG(z,,.--,2,) = [| 2% 
r=] 
for some nonnegative constants w,,...,w, adding up to one. Letting g(@) = 
G(6,1,...,1) for all 8 in N, we arrive at (4.10). 

To see that NG(Q@, z,,..., z,) is a function of the z,’s alone, we proceed along 
the same lines as in the proof of Lemma 4.1. Given z,,..., Z„ > 0, choose e > 0 
such that e < min,.., __,{1/2,1/2z,} and let A, 1 <J < 4, be a partition of N 
such that 0 < p(A,) < e for j = 1,2; 0 < w(A3) < œ and p(A,) > 0. That such 
a partition of N exists follows from the fact that (9, u) is o-finite and nonatomic 
on N. 

Next, define f, as in (4.5) using x, = y, = l and ż, = 1/z,, 1 =1,..., n. If the 
likelihood for (f,,..., fa) is defined by (4.6), we have flf, du = 1/z,,1=1,...,n. 
Letting A, = Uf,/f{lf, du, we have that f,(6) = 1 and R, = z, for 8 € A, U A, and 
i= 1,...,n. Now, since T is externally Bayesian, we know that 


T(h,,...,h,)(@) 
MOVE esd) 


498 C. GENKST, K. J. MCCONWAY, AND M. J SCHERVISH 


is constant -almost everywhere. Also, since /(@) = 1 on A, U Ay, and since (4.1) 
holds p-almost everywhere on N, it follows that 


(Ay, ---1 Fy) (8) 
— U)T( fis- fa )(8) 


on A, U A,. Hence NG(Q, z,,..., Z„) 18 essentially constant as a function of # on 
A, 'U A,. Denote the constant value NG(z,,...,z,). To see that NG is essen- 
tially constant on N as a function of 6, assume to the contrary that there 
exists a subset B of N with »(B)>0 and such that NG(@,z,,...,2,) > 
(<jNG(z,,...,2,) for almost all 6 in B. Since u is nonatomic on N, choose a 
subset of B with positive measure at most e. Let this new set be A,, and repeat 
the above construction, keeping A, the same as before. The conclusion still holds 
that NG(8, z,,..., 2,) is essentially constant on A, U A, in contradiction to the 
assumption that NG(8, z,,...,2,) > (<)NG(z,,...,z,) on Ay. O 


NG(6, Zis... Zp) 


We are now in a position to prove the main result of this section. 


THEOREM 4.4. Let (©, u) be a quaternary measure space and let T: A" > A 
be an externally Bayesian pooling operator. If there exists a u X Lebesgue 
measurable function G: © x (0, œ)" > (0,00) such that (4.1) holds for all 
vectors of opinions (f,,..., fa) in A", then T ts of the form (1.7), i.e., 


(4.11) Df sec he) = él] io) fel fi dp, U-A.e., 


for some essentially bounded function g: © - (0,0) and some constants 
W,,...,W, E R such that }X?_ w, = 1. Moreover, the weights w, are nonnegative 
unless © is finite or there does not exist a countably infinite partition of (9, u) 
into nonnegligibie sets. 


PRoor. If (G,) does not contain any atoms, (4.11) is immediate from 
Lemma 4.3. If (©, u) is purely atomic, then (4.11) derives easily from (4.9) with 
w, =v, t=1,...,n. It is straightforward to see that L”_,w, = 1 from the fact 
that T is externally Bayesian. l 

More difficult is the case in which u has atoms but is not purely atomic. In 
this case, we can use Lemma 4.3 to obtain the result on the set N, the 
complement of the set of atoms of p. Label the atoms @,, 6@,,... and let 
GAX,,...,x,) denote G(6,,x,,...,x,) for all x,,...,x, strictly between 0 and 
1/u(@,). From the definition of externally Bayesian, we have 


1(0)G(8, ACD peer f,(8)) 
G(6, h,(8),..., (8) 
whenever A, is proportional to lf, for all z. From (4.10), we have that on N, the 
left-hand side of (4.12) equals [[7_,¢”, where ¢, is the integral of lf, for each z. 


Now fix ¢,,..., t, and pick a single atom 6. Let e be small enough so that e/t, 
is strictly between 0 and 1/p(0,) for each z. Let A, = {6} and construct the 


(4.12) = constant a.e. u, 


«a 


EXTERNALLY BAYESIAN OPERATORS 499 


same densities as in the proof of Lemma 4.1, setting x, = £ and x* = «/t, for 
each i. (Here A,, A3, Ay, and the y,’s and y,* are arbitrary sets and values 
satisfying the restrictions described in the proof of Lemma 4.1.) Also construct 
the same likelihood / as in that proof. Using the fact that (4.12) holds on all of 
©, we have 


GE ais fi 
a nn 
Gle/t,,..., e/la) p=] 


It follows directly from (4.13) that for all z,,...,z, between 0 and 1/p(@,), 
G (2i. Zn) = G(e,..., ele” TI, 2)". By letting g(6,) = G(e,..., e)e7', we 
have proven (4.11). 

Finally, note that the weights are automatically nonnegative provided that p 
is not purely atomic. If (9, u) is purely atomic but includes a countably infinite 
number of atoms, it is fairly easy to construct densities f,,..., fp which will 
make the integral 


(4.13) 


(4.14) fe [TA du 


infinite and lead to a contradiction unless all the weights are nonnegative. Of 
course, (4.14) is always finite when @ is finite and p is some sort of counting 
measure, so negative weights cannot be ruled out in this case. Similarly, g must 
be essentially bounded, or else there exists f such that (4.14) is infinite when all 
of the f, are equal to f (c.f. Theorem 20.15 of Hewitt and Stromberg, 1965). O 


ł 


Unless the group of experts is reporting its opinions to an outside decision 
maker who chooses to aggregate them using (4.9), it is difficult to interpret g, let 
alone offer advice as to how it could be selected. To some extent, the same 
applies to the interpretation and determination of the weights, although some 
heuristics are available (for example, see Winkler, 1968 or Genest and Schervish, 
1985). In fact, even if the pooling operator (4.9) is adopted by a decision maker, it 
is not so clear what g and the weights stand for. In particular, we should guard 
from concluding too hastily that the function g represents the decision maker’s 
“prior.” After all, there is no reason why a prior density should necessarily be 
bounded, nor is every bounded function a possible prior density. 

One way around the choice of g in (4.9) would be to insist that the pooling 
operator T preserves unanimity. In general, an aggregation procedure T: A” — A 
preserves unanimity if and only if T( f,..., f) =f for all f € A. As the following 
corollary indicates, this is enough to reduce (4.9) to an ordinary logarithmic 
opinion pool. 


COROLLARY 4.5. Let (8,4) be a quaternary measure space and let T: 
A” — A be an externally Bayesian pooling operator which preserves unanimity. 
If there exists a u X Lebesgue measurable functton G: @ x (0, 00)" — (0, œ) 
such that (4.1) holds for all vectors of opinions (f,,..., f) in A", then T is a 


500) C. GENEST, K.J. MCCONWAY, AND M. J SCHERVISH 


logarithmic opinion pool, i.e., 


Tassels) = Mie fin du, pae., 


for some arbitrary reals w,,...,w, adding up to 1. Moreover, the weights w, are 
nonnegative unless © is finite or there does not exist a countable partition of 
(©, p) into nonnegligible sets. 


5. Discussion. The purpose of this paper has been two-fold. On the one 
hand we have provided a characterization of all externally Bayesian pooling 
operators in Theorem 2.5. The form of these operators is quite general, as it was 
derived under the sole assumption that all densities are strictly positive almost 
everywhere. On the other hand, when the space © is assumed to be quaternary 
and the operator is required to satisfy a “locality” condition (4.1), we show that 
the operator must be logarithmic in the sense of (1.7). This latter result does not 
apply to the cases in which © consists of merely two or three atoms. 

If the space © contains only two points, then it is trivial to see that every 
pooling operator satisfies (4.1). In this case, one can easily construct externally 
Bayesian operators which are not logarithmic. One such example is constructed 
from (3.2) as follows. Let b,(0) = 1 and let b,(1) = max{]1, ao,..., a,,}. It is easy 
to see that T(p) equals [max,{ p,},1 — max,{p,}], which is externally Bayesian, 
satisfies (4.1), and is clearly not logarithmic. If (©, u) consists of only three 
atoms, it is not known whether logarithmic opinion pools are the only externally 
Bayesian operators which satisfy (4.1). Theorem 2.5 still holds in this case, 
however. The case of one atom is left to the reader. 

The theorems of this paper shed some light on the mechanics of externally 
Bayesian behavior. If, however, a decision maker wishes to treat the opinions of 
a group of experts as data, little is known about the implications of the 
externally Bayesian criterion for the modeling process. In the Bayesian model 
proposed by Lindley (1985), conditions are given under which the decision 
maker’s posterior distribution would be externally Bayesian, but his conditions 
are not easily interpretable. 


Acknowledgments. Thanks are due to M. H. DeGroot (Carnegie-Mellon 
University), C. Sundberg (University of Tennessee), S. Dharmadhikari (Southern 
Illinois University, Carbondale), and J. V. Zidek (University of British Columbia) 
for fruitful discussions in the course of writing this article. The first author’s 
investigation was supported in part by an N.S.E.R.C. postdoctoral fellowship 
which was held at Carnegie-Mellon University. Part of the second author’s work 
was performed at the University of Washington, which generously provided 
research facilities. 


REFERENCES 


BACHARACH, MICHAEL (1972). Scientific disagreement. Unpublished manuscript. 

DE FINETTI, BRUNO (1954). Media di decision! e media di opinion: Bull Inst Internat. Statist 34 
144-157 

FRENCH, SIMON (1980) Updating of belief in the hght of someone else's opinion J. Roy Statist 
Soc. Ser A 143 43-48 


«eo 


EXTERNALLY BAYESIAN OPERATORS 501 


FRENCH, SIMON (1981). Consensus of opinion. European J. Oper. Res. 7 332-340. 

FRENCH, SIMON (1985). Group consensus probability distributions: A critical survey. In Bayesian 
Statistics 2 (J. M. Bernardo, et al., eds.) 183-201. North-Holland, Amsterdam. 

GENEST, CHRISTIAN (1984a). A conflict between two axioms for combining subjective distributions. 
dJ. Roy. Statist. Soc. Ser. B 48 403-405. 

GENEST, CHRISTIAN (1984b). A characterization theorem for externally Bayesian groups. Ann. 
Staust. 12 1100-1106. 

GENEST, CHRISTIAN (1984c). Pooling operators with the marginalization property. Canad. J. Statist. 
12 153-168. 

GENEST, CHRISTIAN and SCHERVISH, MARK J. (1985). Modeling expert judgments for Bayesian 
updating. Ann, Statst. 13 1198-1212. 

GENEST, CHRISTIAN and ZIDEK, JAMES V. (1986). Combining probability distributions: A critique 
and an annotated bibliography (with discussion). Statist. Sct. 1 114-148. 

GENEST, C., WEERAHANDI, S. and ZIDEK, J. V. (1984). Aggregating opinions through logarithmic 
pooling. Theory and Decision 17 61-70 

Hamos, PAUL R. (1950). Measure Theory. Van Nostrand, New York. 

HEWITT, EDWIN and STROMBERG, KARL (1965). Real and Abstract Analysis. Springer, New York. 

LADDAGA, ROBERT (1977). Lehrer and the consensus proposal. Synthese 36 473—477. 

LINDLEY, DENNIS V. (1985). Reconciliation of discrete probability distributions. In Bayesian Statts- 
tics 2 (J. M. Bernardo, et al., eds.) 375-390. North-Holland, Amsterdarn. 

MADANSKY, ALBERT (1964), Externally Bayesian groups. Technical Report RM-4141-PR, RAND 
Corporation. 

MADANSKY, ALBERT (1978). Externally Bayesian groups. Unpublished manuscript, University of 
Chicago. 

McConway, KEVIN J. (1978). The combination of experts’ opinions in probability assessment: Some 
theoretical considerations. Ph.D. thesis, University College London. 

McConway, KEVIN J. (1981). Marginalization and linear opinion pools. J. Amer. Statist. Assoc. 76 
410-414. 

Morris, PETER A. (1974) Decision analysis expert use. Management Sct. 20 1233-1241. 

Morris, PETER A. (1977). Combining expert judgments: a Bayesian approach. Management Sct. 28 
679-698. 

RtAIFFA, HOWARD (1968). Decision Analysis: Introductory Lectures on Chowces under Uncertainty. 
Addison-Wesley, Reading, Mass. 

WAGNER, CARL G. (1982). Allocation, Lehrer models, and the consensus of probabilities. Theory and 
Decision 14 207-220. 

WAGNER, CARL G. (1984). Aggregating subjective probabilities: Some lmitative theorems. Notre 
Dame J. Formal Logic 25 233-240. 

WALD, ABRAHAM (1939). Contributions to the theory of statistical estimation and testing hy- 
potheses. Ann, Math. Statist. 10 299-326. 

WEERAHANDI, SAMARADASA and ZIDEK, JAMES V. (1981). Multi-Bayesian statistical decision theory. 
J. Roy. Statist. Soc Ser. A 144 85-93. 

WINKLER, ROBERT L. (1968). The consensus of subjective probability distributions. Management 
Sct. 15 B61-B76. 

WINKLER, ROBERT L. (1981). Combining probability distributions from dependent information 
sources. Management Sct. 27 479—488. 


CHRISTIAN GENEST KEVIN J. MCCONWAY 

DEPARTMENT OF STATISTICS DEPARTMENT OF STATISTICS 
AND ACTUARIAL SCIENCE THE OPEN UNIVERSITY 

UNIVERSITY OF WATERLOO MILTON KEYNES MK7 6AA 

WATERLOO, ONTARIO N2L 3G1 ENGLAND 

CANADA 


MARE J. SCHERVISH 

DEPARTMENT OF STATISTICS 
CARNEGIE-MELLON UNIVERSITY 
PITTSBURGH, PENNSYLVANIA 15213 


The Annul of Statisties 
L980, Vol 14, No 2, 502-516 


A BAYES PROCEDURE FOR THE IDENTIFICATION 
OF UNIVARIATE TIME SERIES MODELS 


By D. S. PO8KITT 


Australian National University 


This paper ıs concerned with model selection ın time series analysis. An 
identification criterion is presented that ıs asymptotically equivalent to a 
Bayes decision rule. The discussion is conducted in the context of a general 
class of parametric time series models and consideration is given to the 
special case of order determination in autoregressive moving-average repre- 
sentations. Consistency of the criterion is proved. 


1. Introduction. Recently, attention has been directed in the time series 
literature to the problems associated with choosing a finite parameter model for 
an observed process. Much of the discussion has been concerned with the 
development of model selection criteria and although alternative principles and 
methods have been employed in the derivations the criteria are usually expressed 
as functions of the estimated innovation variance. See, for example, Akaike 
(1969), Parzen (1974), Hannan and Quinn (1979), and Shibata (1980), where the 
argument is conducted in the context of autoregressive processes, and Hannan 
(1980), Taniguchi (1980), and Hannan and Rissanen (1982), where the analysis is 
extended to autoregressive moving-average models. Alternatively Hosking (1980), 
Poskitt and Tremayne (1981), and Potsher (1983), amongst others, have investi- 
gated conventional hypothesis testing procedures designed to test the adequacy 
of a chosen fitted model. The diagnostic checking devices considered may be 
represented as functions of the residual autocorrelations and are related to the 
portmanteau statistic discussed by Box and Pierce (1970). In the present paper 
results from decision theory are used to determine a Bayes decision rule for time 
series model identification. A tenuous link between the two procedures previ- 
ously mentioned is thereby obtained as the selection criterion so derived can be 
expressed in terms of both the one-step ahead prediction error variance and the 
autocorrelations of the residual process. For an alternative approach to modify- 
ing existing model selection criteria see Rissanen (1983). 

Autoregressive moving-average, ARMA( p, q), representations of the form 


p q 
(1.1) x(t)+ | axl(t-i)=¢t)+ }ĵ uélt-i) t=0,41,..., 

t=] i= ] 
where {=(¢)} is a white noise process, provide a general and widely applied class 
of models for stationary time series but the availability of finite parameter 
models where the power spectrum is not rational, as in Bloomfield (1973), leads 
to a consideration of a more general problem. 


Received September 1983; revised September 1985. 

AMS 1980 subject classifications. Pnmary 62M10; secondary 62C10. 

Key words and phrases. Time series model, power spectrum, autoregressive moving-average 
representation, Bayes decision rule, model selection criterion, consistency. 


502 


BAYES IDENTIFICATION OF TIME SERIES 503 


ASSUMPTION P. Let {X(Ł)} be a discrete time nondeterministic zero mean 
Gaussian stochastic process with power spectrum f(w) € L?, the class of func- 
tions square integrable with respect to Lebesgue measure v on [ —7, r]. 


Suppose that a model for an observed process assumed to satisfy Assumption 
P is characterised by a particular functional form for the power spectrum and is 
specified by a parametric family M = {g(6,w) € L’,§ e O} satisfying the fol- 
lowing assumptions. 


ASSUMPTION M1. The parameter space © is a nonempty open subset of R“ 
where the positive integer d is referred to as the model dimensionality. The 
closure of ©, ©, is convex and bounded. 


ASSUMPTION M2. The function g is continuous on © x [~—7, 7], g > 0, and 
if 6, # 4, then g(0,, w) # g(@,, w) on a set of positive Lebesgue measure. 


ASSUMPTION M3. The partial derivatives dg(8,w)/06,, 0°%g(8, w)/3ð, 06, 
and d'g(8, w)/00, 30, 06,, i, j,k =1,..., d, are continuous on © X [~1, 7]. 


The above model requirements will be referred to as Assumptions M and, 
together with Assumption P on the process, will be maintained throughout the 
paper. 

In order to quantify the adequacy of a model a suitable measure cf the 
consequences of employing a particular parametric specification is required. In 
the following section of the paper a frequency domain utility function that 
provides a measure appropriate for the observational decision problem cf dis- 
criminating among a given set of alternative models on the basis of a finite 
realisation is discussed. The likelihood function and certain asymptotic proper- 
ties of the likelihood and the Gaussian estimator of theta are also considered. In 
Section 3 the principle of precise measurement, Savage (1962), is pursued. 
Starting from a position of prior ignorance a Bayes decision rule, which for a 
given realisation maximises the average utility with respect to the posterior 
distribution of the model and its parameters, is derived and the results are 
specialised to the ARMA case. Prior ignorance, that is, the notion that little is 
known a priori relative to the information provided by the data, is represented 
using an invariant prior distribution as suggested by Jeffreys (1961). Such a 
choice is noninformative but does not result in an improper prior distribution 
here due, essentially, to Assumptions M. To this extent the present formulation 
may be thought unconventional, although it is related to the standard Bayesian 
procedure for model discrimination, Pericchi (1984). This procedure has been 
criticised in the literature as in some situations it is asymptotically inconsistent, 
Pericchi op. cit., and therefore the large sample behaviour of the posterior 
expected utility is investigated further in Section 4. Consistency of the model 
selection criterion is established and the implications for the practical imple- 
mentation of the identification procedure discussed. The proofs of the main 
lemmas contained within the paper are assembled in the final section. 


504 D. S. POSKITT 


2. Model utility and likelihood. In the present problem an action involves 
choosing a model and selecting from within the parametric family a particular 
member. The action and state coincide with L’. Given a set of m models, 
M, = {g (9, 0) E L’, 0, E 0,}, z=1,...,m, what constitutes a best action de- 
pends upon the extent of available knowledge concerning the true state of 
nature. For any §6€ O let m(8) denote the action of choosing the particular 
member g(9, w) from the family M. For convenience, here and throughout the 
paper, the model subscript z, ¿= 1,..., m, is omitted and generic notation 
employed where this raises no ambiguity. In the extreme but unrealistic situa- 
tion that the true power spectrum f(w) is known, a natural measure of the 
regret or loss involved in taking action m(6) is given by the integrated squared 
relative error 


a { flw) i 

(2.1) n{m(6)} fi t) dw. 
The associated utility may be taken as U{m(9)} = exp[ —7{m(8)}]. For theoret- 
ical and practical purposes, however, it is necessary to consider a rather more 
complicated specification of the utility function and in particular it is necessary 
to allow it to depend on and be modified by the observations. 

Given a realisation x, = (x(1),...,x(7')) of T observations on the process 
{ X(£)} set the sample power spectrum, or periodogram, 


Ilo) = (2aT)7"|Z,() |", 





where 
T 


Z7(w) = } x(t)exp(—zwt). 


f=] 


Evaluating Ip(«) at the frequencies w, = 27j/N, —(T—-l) <7 s(T-1), N= 
2T — 1 the loss of action m(8) can be approximated using the numeraire 


rhw) Ir(%,) 


(2.2) nrim(0)} = — ra Ta erry Bey + Qa. 


LEMMA 1. The numeraire n7{m(8)} converges to n{m(8)} with probability 
one, uniformly for all 8 in QO. 


The implication of Lemma 1 is that as the sample size increases any departure 
of 7,{m(8)} from zero reflects more a loss from using an inappropriate model 
than a departure of the numeraire from the theoretical but unknown regret given 
in (2.1). The utility associated with m(6) will therefore be taken as 


Up{m(8)} = exp[~np{m(8)}]. 
The likelihood of a model M and its associated parameter vector 8, denoted 
pr(x7|M, 6), is 


(2.3) exp| — 1 (In det ©,.(8) + T ln2a + x20) x), 


BAYES IDENTIFICATION OF TIME SERIES 505 


where the variance—covariance matrix of the vector Xy 
=(8) = J 2(8, w)et(r- do), E E 


In order to re-express pr(xr|M, 8) in the frequency domain consider the function 





l Ip(%,) 
l (0) = FE (none w,) + ah 
LEMMA 2. For all 6 in ® 
(i) [Tin det 27(0) - In27 - N-'Ẹ.n g(0, o)l > 0 
and l 
i [T a420) xp NE rlo), o) > 0 
J 





almost surely (a.s.) and uniformly. 


An immediate corollary of Lemma 2 is that T~'|In pr(xp|M, 0) + tT In2r + 
\ TL,(8)| > 0 a.s. and the asymptotic behaviour of the likelihood can be inyesti- 
gated via the limiting behaviour of 1,8). 

A second consequence of Lemma 2, and its proof, is that /,(8) converges 
uniformly in @ and with probability one to the corroborant function 


af fw) | 
(6) = (2r) In 2ag(8,0) + ———— | do, 
(8) = (27) rl (8,0) + Ty 
a continuous function of @ on the compact set ©. Let 6, denote a value of theta 
at which the infimum of /(6) is achieved, that is /(@,) < /(8) for all 0 € ©. The 
vector 4) yields the best fitting member of the family M for the process {X(t)}. 
However, since In y + d/y, d > 0, is minimised at y = d, 


(2.4) (6) 21+ (27) f In2rf(w)dw, 


with equality if and only if g(®,w) = f(w) almost everywhere (a.e.). This 
equality cannot be assumed to hold for any specification. For this reason 6, will 
be referred to as the pseudo true parameter for the model. Now let 6, be a value 
minimising J,.(8); such a value exists because l (0) is a continuous function 
defined on a compact set. Employing the nomenclature of Whittle (1962) 6, will 
be called the Gaussian estimator. The relationship between the Gaussian estima- 
tor and the pseudo true parameter and the properties of l (8) described 
immediately below are germane to the subsequent analysis of model expected 
utility. 


LEMMA 3. If 8, 1s unique and les ın the interior of © then 6, is a strongly 
consistent estimator of §). 


506 D. S. POSKITT 


LEMMA 4. Let H,(9) denote the hessian matrix d71,(8)/06 00’. Then H(8) 
converges to 


21(8) = T 


290 es 


d7In g(8, w) d2g(8,w)7' 
se ee re ee d 
| aar EA i 


a8 dé’ 
almost surely and uniformly in 6 € ©. 


The following addition to Lemma 4 supplementing Assumptions M proves to 
be necessary in Section 3. 


ASSUMPTION M4. The information matrix I(4,) is positive definite. 


By Assumption M3 K9) and H.-(8) are continuous in 9 and it follows that for 
T sufficiently large H,(6,) will also be positive definite a.s. in a neighbourhood 
of §, because the uniform convergence of 4H,.(8) to 1(8) and the convergence of 
§,, to @ a.s. ensure that 1H (8) is a strongly consistent estimator of I(6,). 


REMARK. If the model obtains then the pseudo true parameter point 8, 
coincides with the true value of the parameter and f(w) = g(Q), w) a.e. In this 
case 1(8,) simplifies to 


1 m ding(6,w) din g(6, w) 
BA eS aed i a 
Am doy 08 08’ 


This corresponds to the usual expression given for the information matrix. See 
Hannan (1973, page 137) and the references contained therein. 


3. An asymptotic Bayes decision rule. In order to proceed a specification 
for the prior distribution of the model and its associated parameters pr( M, 8) is 
required. As indicated above, the prior distribution of 6 given M is chosen 
noninformatively following Jeffreys (1961, Chapter 3) as 


pr(8|M )a{det 1(0)}'7’. 


As it is common practice to espouse the principle of parsimony when modelling 
time series, the model prior pr( M) is assumed to be proportional to (27)~¢/”. 
This gives a prior odds ratio of approximately 2:5 and, reinterpreting Jeffreys 
(1961, Appendix B), indicates an indecisive preference for every unit decrease in 
model dimensio! ali.y. For some discussion of alternative model priors and their 
interpretat on see Poskitt and Tremayne (1983). By virtue of Assumptions M 


0 < f {det 1(0)}'” dð < sup {det 1I(@)}'/"»,(0) < 2, 
Ə 9 
and given that E, (27)7%/ < œ the prior distribution, 


pr( M, 8) = pr(M)pr(@|M)a{ (27) “det 1(0)} ”, 


may be normalised to give a proper mixed mass-density function. 


él 


rova att 


BAYES IDENTIFICATION OF TIME SERIES 507 


The Bayes decision rule can now be constructed using the extensive method of 
analysis, Raiffa and Schlaiffer (1961, Chapter 1). The action m(8) is sought 
which, for a given data set, maximises the posterior expected utility 


E{U(m(0)}] = f U(m(0)}pr(M, 0)pr(xriM, 0) d0/pr(xr), 
where 
pr(xr) = È pr(M,) | pr(0IM,)pr(xr|M,, 8) d8. 


This provides a principle for determining the best action in relation to the 
current realisation. Let 


Ep[Ur{m(6)}] 
=K i exp[—nr{m(8)}] ((27) “det 1(8)} ” exp[- 471p(8)] a8, 
where the constant K is the reciprocal of 


eels J (en) “aet1,(8)} dð. 


i=] t 


(3.1) 


Employing the approach of Lindley (1960) and letting T —> œ, Lemmas 1 and 2 
and the Arzela—Ascoli theorem imply that |£,[U;{m(6)}] — E[U{m(8)}]| > 0 
a.s. and the limiting behaviour of the posterior expected utility can therefore be 
ascertained from that of the integral in (3.1). Using Lemmas 3 and 4 it is possible 
to establish the next lemma involving the second two factors of the integrand. 


LEMMA 5. For all values of T and each 6 € ® set 


(0) = {(T/2m) det I(80)} exp[—47{1p(8) — Lr(êr)}]. 


Then {,(8)} forms a sequence of regular generalised functions, translated to 
r, converging to a Dirac delta function. 


Lemma 5 is based on the result that asymptotically the likelihood, which is of 
order T, behaves like the kernel of a d variate Gaussian density function with 
mean vector 6, and varlance—covariance matrix 2H (z) 1/T. Consequently, as 
the sample size increases ¢p(8), which is proportional to expl i Tl (8 }] times the 
posterior density, approximates an impulse function centred at 6, Therefore 
when the expectation 


Er[Ur{m(8)}] =K exp| — +Tlr(6r)] T- f exp| —nr{m(8)}] 6r(8) dð 


is evaluated any values outside of an arbitrarily small neighbourhood of 6, make 
a negligible contribution to the integral. Taking logarithms and neglecting 
common factors then gives rise to the following theorem. 


Co n aaa mman aaao o 


508 D S. POSKITT 


THEOREM 1. The posterior expected utility ıs maximised asymptotically by 
selecting the model which minimises the criterion function 


A{m(6,)} = ITL (fr) + $d InT + 9,-{m(6,)}. 


It is, perhaps, worth pointing out that the first two terms of A coincide with 
the BIC criterion function associated with Rissanen (1978) and Schwarz (1978). 
These terms may be thought of as determining the posterior probability of a 
model, Poskitt and Tremayne (1983), and the final model selection is based upon 
a trade-off between the estimated posterior odds of the models and their relative 
utilities as represented by the last term. 

Consider now the ARMA( p, q) model of (1.1). In order to satisfy Assumptions 
M the structural parameters a,,...,a, and p,,...,, are assumed to belong to 
the subset of #°*? defined by the requirements that the roots of a(z) and p(z) 
lie outside the unit circle, a(z) and (z) have no common factors and a, and g, 
are not both zero. For this model 


a” ee 
g(8,w) = ag K(e a 
oT 


where K(z) = LR z/ = p(z)/a(z), and the scale parameter a? is assumed to lie 
in the interval (8,1/8) for arbitrarily small 6 > 0. Substituting in /,(8), 6 = 
(o”, Aiseee An Hue- Hg) and concentrating with respect to ø? it is easily 
shown that 


l (0) > Ins? + N-'S“In|K(e™) |? +1, 
i 


where 


Qa Ip(,) 


N G |K(ew™)[ 


2 


Of course, K and s? are functions of 6 and although for convenience this is not 
shown explicitly in the notation the evaluation of these quantities at the point 
corresponding to the Gaussian estimator will be indicated by the use of a 
circumflex. The following result now follows from the above theorem after some 
straightforward manipulations. 


COROLLARY. The Bayes decision rule is asymptotically equivalent to choos- 
ing the ARMA( p, q) model that minimises the criterion function 


A( p,q) = iris 82+ N'Y In| R(e7™) "| + 4(p + g)inT 
J 


ite) | 


a/N a ae 
+q7/ oy, 


J 


x} 


BAYES IDENTIFICATION OF TIME SERIES 309 


To provide a heuristic interpretation of this corollary assume that an 
ARMA( py, qo) model obtains so that, employing an obvious notation, 


flo) = galbo o) = Z| Ke)? ae. 


and the pseudo true parameter value ĝ, now corresponds to the true parameter 
point in the usual sense. It is then well known that {=(t)} is the innovation 
process associated with Wold’s decomposition theorem and that 


joes exp|(2n) f" In2nf(w) do, 


the variance of the minimum mean squared error one-step ahead prediction 
error. This implies that 


(27) f mK) dw = 0. 


Furthermore, (1.1) defines for any p and q a residual process which may not be 
white noise unless the model is correct, that is p = pọ and q = qo, and the true 
parameter point 6 is employed. Estimating the uth autocovariance of the 
residual process in the frequency domain by 


Qa _ Ip(w, et? 
a pea a 
J |K(e er) | 


Cy(u) = 
it follows from Parseval’s theorem that the last term of A( p, q) may be written 
as 

Tai 
20 » {Fp (uw) \ ; 
u=0 
where f,(u) = Ĉ (u) J C,(0) = Ĉ (u) /8? is the uth residual autocorrelation. 
The components of A( p, q) may therefore be regarded as assessing the extent to 
which the theoretical relationships of the true data generating mechanism are 
satisfied by the best fitting member in the family, or model, under consideration. 


REMARK. The last term of A( p, q) is equivalent to the portmanteau statistic 
of Box and Pierce (1970) except that multiplication by T in the usual statistic is 
replaced by multiplication by the constant 27 and the range of summation is 
extended to include autocorrelations at high lags. The statistical properties of 
this statistic and some evidence on its performance relative to portmanteau tests 
are presented in Milhej (1981). 


To summarise, in the context of ARMA models, the criterion function leads to 
the selection of the model that appears to provide the best compromise between 
predictive characteristics, the autocorrelation structure of the residuals and 
model dimensionality. 


510 D. S. POSKITT 


4. Consistency of the delta criterion. The analysis of the previous section 
indicates that, given a set or range of models {M,, i = 1,..., m}, the adoption of 
Bayes rule leads to the selection of the kth model M, if 


Should the minimum not be unique the practitioner is thought to be indifferent 
between those models M,, j =1,...,/< m that yield the minimising value. 
Taking this decision rule as given its performance can be analysed by examining 
the sampling behaviour of the expected utility ratio 


Rr(i, 7) = exp[—A{m,(6,-)}] /exp[ -A{m,(§,,)}], 


l,j =1,...,m,it1#J,asT > oo. 

Following the previous discussion a model is said to be true, or to obtain, if 
the pseudo true parameter 4, associated with the model is such that g(8), w) = 
f(w) a.e. and a range of models is said to encompass the data generating 
mechanism if there exists a specification M, € {M,, 1=1,..., m} that is true. 
By virtue of transitivity, the behaviour of the decision rule when comparing a 
range of models is characterised by the following theorem, in which the notation 
a, ~ br is used to mean that (a;/b;) > las T > œ. 


THEOREM 2. Let R,(2,1) denote the expected utility ratio between two 
models M, = {g,(8,,w) € L’, 6, € ©,}, i= 1,2, with pseudo true parameter val- 
ues ĵo and 0,,, respectively. If 

(1) M, and M, do not encompass the data generating mechanism and 


fo) | | fe) 
8,(819, w) Elbo w) 
for ô < 1, uniformly in w, or 

(ii) M, is true and gb, w) # f(w) on a set of nonzero v measure with 


relative error bounded above as in (i), 
then there exists a constant C > 0 and a decreasing sequence Cy N C such that 


R,(2,1) ~ expl -C,T ] ~ {exp[-—C]}" a.s. 


0< TE 











If 
(iii) both M, and M, obtain, 
then 


aD a Aa a 
To prove Theorem 2 observe from Lemmas 1, 2, 3, and 6 that 
2A{m(6,.)} = TU) + dInT + 2n{m(®%)}+0o(T) as. 
Assuming that condition (i) or (ii) holds, substituting the expansion 
_ fe) | fo) -1| E | flo) P 
g(8,0)  \g(0, w) g(8, w) 











BAYES IDENTIFICATION OF TIME SERIES 511 


in the definition of the corroborant function and simplifying the resulting 
expression for the difference 2[ A{m,(6,-)} — A{m,( 6,-)} gives the equation 


2In R,(2,1) 2\ 1 æf flo) ? f(w) Í 
T 7 £ + Feel Poses g J 7 Row E r a 


+0(8°) + o(1) as. 


for the logarithm of the expected utility ratio. From the properties of the regrets 
n{m,(8,,)} and n{m(&o)} implied by statements (i) and (ii) the first term above 
is negative definite and, since 6 is arbitrary, there therefore exists a constant 
C > 0 and a decreasing sequence Cp N C such that 


IT- n R,(2,1) + Cr| >0 a.s. 


as T — oo, giving the required result. Similar arguments applied under condition 
(iii) show that 


IT-(2In R,(2,1) - (d, - d,)inT)| > 0 as., 


which is equivalent to the statement of the theorem as limyT is unity. 

The structure of Theorem 2 shows that if {X(¢)} emanates from an unknown 
specification within {M,, i= 1,..., m} then, as the sample size increases, A will 
select the true model with probability one. Furthermore, since for any a, b > 0, 
T“ = o{exp(bT)}, the theorem also lends support to the intuitively appealing 
notion that it is possible to distinguish a model that is true from one that is not 
more readily than it is to reconcile alternative parametric families both of which 
obtain. In the latter situation the decision rule will, for large finite T, resolve the 
dilemma posed by having to select one model by choosing the most parsimoni- 
ous, although any preference may be blurred. Taking dimensionality d as an 
index of model complexity this amounts to an implementation of the principle of 
simplicity, Rosenkrantz (1983, Chapter 5). See also Rissanen (1983) and, for some 
discussion of the philosophical point that consistency is to be equated with 
choosing the most parsimonious true model, Atkinson (1980). 

The existence of a definitive true model of finite dimensionality can of course 
be called into question. Shibata (1980), for example, who is concerned with mean 
squared prediction error and whose results have been generalised by Taniguchi 
(1980), explicitly assumes infinite true order, and Stone (1979) has stressed the 
need to consider more complex, profligate parameterisations as T > oo. Recogni- 
tion of this motivates consideration of the behaviour of (2,1) when the model 
set does not encompass the data generating mechanism as in Theorem 2(i). When 
ô is small both M, and M, provide reasonable guides to the data generating 
mechanism but the first model gives a more accurate approximation to the 
distribution of power induced by f(w) and, hence, to the actual structure of the 
process. Asymptotically, the magnitude of R-(2,1) will reflect the relative 
proximity of the two models and will lead to the identification of the best fitting 
parametric family M,. 

The implication of the foregoing discussion is that model posterior expected 
utilities provide a potentially useful basis for making comparisons between 


D12 D. S. POSKITT 


alternative parametric specifications. If, when considering a range of models, the 
numerical value of the criterion A for one model is small in relation to that of 
another then this will indicate that the model is more appropriate for the process 
in hand and is to be preferred. There may be circumstances, however, where a 
mechanistic application of the decision rule to select a single preferred model will 
be undesirable as little consideration will thereby be given to the relative merits 
of other specifications. If the expected utility ratio is close to one for two or more 
models then any preference between them may be indistinct. This raises the 
possibility of the practitioner simultaneously entertaining a few parametric 
families between which she/he is essentially indifferent or employing an average 
model, as suggested by Akaike (1978) for example, in order to better understand 
the process. The question of the possible efficacy of using the criterion A to 
identify univariate time series models can only really be answered by reference to 
experience and experimentation however. 


5. Proofs. In the sequel ||A]| will be used to denote the operator norm 
SUP j2)-11/Az/| of a matrix A where, for any vector z, ||z|| is the Euclidean norm. 
The letter C denotes a universal constant. Consider first the following pre- 
liminary result. 


LEMMA 6. Let 
Pai 


Ya r(0) = (27/N) 2 Iro, )h(8, w) 
and 
4 ; 
Wo (8) = (27/N) L { Ip(w,)A(8, w,)} , 


where h(8, w) is a continuous real valued function on © x [—1, 1] with continu- 
ous partial derivatives dh(8, w)/06,, i = 1,..., d. 
Then 


¥i,7(8) > | f(w)h(8,0)do as. 
and 
Ya 7(8) > 2f { f(@)A(8, w)}* da a.s. 
uniformly for all 8 in ®. 
Proor. The limiting behaviour of smoothed periodogram values has been 
discussed elsewhere in the literature under weaker regularity conditions than 
those assumed at present. See Anderson (1971, Chapters 8 and 9) and Brillinger 


(1975, Chapters 4 and 5). The methods employed by these authors can be applied 
here to show that for any 0 € 6 


Yi, r(8) > f f(w)h(8,w)dw as. 


BAYES IDENTIFICATION OF TIME SERIES 513 


as T — æ. To prove that the convergence is uniform, note that for any 6 > 0 
and §,,8, € © such that ||6, — @ lj < ô 


h(8,,) — h(6, w) = (8, — 8) h(8, w)/08, 
where ĝ = 6, + A(8, — 8), 0 <A < 1, by the mean value theorem. Hence 


d 
[h(8,, w) — (8, w)| < 8 È sup dh(0, w) /30,, 


where the supremum is taken over [ — r, 7] and 8 € N,(@,) = {0: ||6, — Oll < 5}, 4 
being chosen so that N,(8,) C ©. Consequently there exists a constant C such 
that 


T-1 
v7 (8,) = Wi rí) <ê: C- (2r/N) 2 Ip(%w,) 


j=-T+1 


and (27/N )Ł I;(w,) converges to ff(w)dw as. Thus, Y, p(8) is equicontinuous 
and converges uniformly to {f(w)h(8, w) dw. 

The proof that y, -(8) converges as indicated proceeds along almost identical 
lines, c.f. Milhej (1981, Lemma 1). 


PROOF OF LEMMA 1. This is obtained at once from Lemma 6 on setting 
h(8, w) = 1/g(8, w). 


PROOF OF LEMMA 2. Part (i) of the lemma follows almost immediately from 
a theorem due to Grenander and Szegé (1958, Chapter 5) that concludes that 


lim Tn det 3,(8) = (27) 'f” In2ag(8, w) do; 


Tox 


it is only necessary to observe that 


T-1 oC 7 
N E Ina(8,o,)= X p, kN) > (27) f Ing(8,0) de, 


j=- T+) k=- 


as the coefficients 
o(8, n) = f In g(0, we" dw 


decline at a geometric rate, to show convergence for a given 8. The fact that the 
limit is uniform in 8 is not stated explicitly in Grenander and Szegö although it 
follows directly from the uniformity of the order relations used in their proof. 
The proof depends upon the approximation of g(9,w) by trigonometric poly- 
nomials. However, g(0, w) is a continuous function of 8 and w on © x [—72, 7] 
and is, by assumption, bounded away from zero. Hence g(8, w) may be ap- 
proximated by a polynomial uniformly in 6 and w on © x [—2,7] and this 
completes the proof of (i). 

Part (1i) may be derived similarly from the preliminary lemma, see also 
Hannan (1973, Lemma 4). 


514 D. S. POSKITT 


PROOF OF LEMMA 3. By a direct application of Lemma 6 1,(8) converges 
uniformly to (8) as. Let @* be a limit point of the sequence 6,. By uniform 
convergence and continuity, for any ô > 0, |4 (8+) — 1(8*)| < 8/2 and {2,(6,) — 
L.(8*)| < 6/2 as. for T > T, and hence 1,(6,) —> 1(8*) a.s. By definition, 1(6,) < 
6') and 1,(6,-) < 1,(6,) for all T, which implies that (0t) = 1(8,). Since Z has a 
unique minimum at 6, it follows that 6° = 6). 


PROOF oF LEMMA 4. The proof of this lemma involves arguments similar to 
those already employed in proving Lemmas 1 and 2. The details are omitted. 


PROOF OF LEMMA 5. Consider 


6) dð = 6) dð + 6) d8 
f (8) Jpeg, O78) l a 07) 


where, for e > 0 arbitrary, E,6,) is the elliptical neighbourhood {0: (0 — 
f ) H (ê, X0 — 6,) < £}. By Lemma 3 there exists a T, such that for arbitrary 
6,, 6, lies in the spherical neighbourhood N,(®%) for all T> T, and for 
0 E © N N,(8)} there exists a positive constant C such that 1,(8) — 1,-(8) > C. 
Set 6, = ¢/(2\|H,(6,)|). Then N,(8) © Nz (fr) € E(6;) which implies that 
© N Ebr) c © \ N, (%) and hence 


1/2 = 
1,.= { supdet 1(8)} T?exp| —47C|»,(0) > 0 as. 


as T > œ. 
Since I(ĝ) is a uniformly continuous function of 8, for any ¢ > 0 there exists a 
ô, > 0 such that 


det 1(6,)(1 — £) < det I(@) < det 1(6,.)(1 + £) 
for all 8 E N; ($r) and by Taylor’s theorem 
Qr(8)(1 — $) < Ip(8) — lr(fr) < Qr(0) + $), 
where 
Qr(8) ai 3(8 a r) H,(6,)(8 E êz). 


If e = 6,/\JH7(6,)~'\| then Er) C N, (87). Substituting in ¢r(80) and employ- 
ing Lemmas 3 and 4 in conjunction with the properties of the multivariate 
normal density function and the incomplete gamma ratio yields 


= (i$) {1 ~ yr(e(l + $)/2)}, 
where, using conventional notation, 


n-l 


yr( y) = exp(— Ty/2) }, (Ty/2)’/T( 7 + 1) 


J=0 


se 


BAYES IDENTIFICATION OF TIME SERIES 515 


when d = 2n and 


Yr( y) = exp(—Ty/2) E (Ty/2)} "P /TU + 3/2) + 2{1 - &(VYTy)} 


J=0 


when d= 2n — 1, n<(d+ 1)/⁄2<n+ 1. As § is arbitrary it follows that 
I, > 1 as. as T > œ. The proof is completed using standard results from the 
theory of generalised functions, Zemanian (1965, Theorem 2.3.2) and Gel’fand 
and Shilov, (1964, pages 34-39). 


Acknowledgments. I would like to thank the referees for their comments. I 
am also grateful to an associate editor for helpful remarks and constructive 
suggestions on the content of the paper and for reference to the article by M. 
Taniguchi. These have led to corrections, alterations and, I believe, improve- 
ments in the presentation of the paper. 


REFERENCES 
AKAIKE, H. (1969). Fitting autoregressive models for prediction, Ann. Inst. Statist. Math. 21 
243-247. 
AKAIKE, H. (1978). A Bayesian analysis of the minimum AIC procedure. Ann. Inst. Statist. Math 
30 9-14. 


ANDERSON, T. W. (1971). The Statistical Analysis of Tune Series. Wiley, New York. 

ATKINSON, A. C. (1980). A note on the generalised information criterion for choice of a model. 
Biometrika 67 413-418. 

BLOOMFIELD, P. (1973). An exponential model for the spectrum of a scalar time series. Biometrika 60 
217-226. 

Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive- 
integrated moving average time series models. J. Amer. Statist. Assoc. 65 1509-1526. 

BRILLINGER, ID. R. (1975). Tune Seres: Data Analysts and Theory. Holt, Rinehart and Winston, 
New York. 

GEL’FAND, I. M. and SHILOV, G. E. (1964). Generalized Functions 1. Academic, New York. 

GRENANDER, U. and SZEGÖ, G. (1958). Toepittz Forms and thew Applications. Univ. California Press. 

HANNAN, E. J. (1973). The asymptotic theory of linear time series models. J. Appl. Probab 10 
130-145. 

HANNAN, E. J. (1980). The estimation of the order of an ARMA process. Ann Statist. 8 1071~1081. 

HANNAN, E. J. and QUINN, B. G. (1979). The determination of the order of an autoregression. 
J. Roy. Statist, Soc. Ser. B 41 190-195. 

HANNAN, E. J. and RISSANEN, J. (1982). Recursive estimation of mixed autoregressive-moving 
average order. Biometrika 69 81-94. 

HoskING, J. R. M. (1980). Lagrange-multipher tests of time series models. J. Roy. Statist. Soc. Ser. 
B 42 170-181. 

JEFFREYS, H. (1961). Theory of Probability. 3rd ed. Clarendon, Oxford. 

LINDLEY, D. V. (1960). The use of prior probability distributions in statistical inference and 
decisions. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1 453-468. Univ. California 
Press, 

MILHøJ, A. (1981). A test of fit ın time series models. Biometrika 68 177-187. 

PARZEN, E. (1974). Some recent advances in time series modelling. IEEE Trans. Automat. Control 
AC-19 723-730. 

PERICCHI, L R. (1984). An alternative to the standard Bayesian procedure for discrimination 
between norma] linear models. Biometrika 71 575-586. 


516 D. S. POSKITT 


Posxitt, 1). S. and TREMAYNE, A. R (1981). An approach to testing linear time series models. Ann. 
Statist. 9 974-986. 

Poskitt, D). S and TREMAYNE, A. R. (1983). On the posterior odds of time series models 
Biometrika 70 157-162. 

POtTsCHER, B. M. (1983). Order estimation in Arma-models by Lagrangian multiplier tests. Ann 
Statist, 11 872-885 

RAIFFA, H. and SCHLAIFFER, R (1961). Applted Statistical Decision Theory M.1.T. Press, Cam- 
bridge, Mass. 

RISSANEN, J. (1978) Modelling by shortest data descnption. Aufomatica-—J. IFAC 14 467-71. 

RISSANEN, J. (1983). A universal prior for integers and estimation by minimum description length 
Ann. Statist. 11 416-431. 

ROSENKRANTZ, R. D. (1983) Inference, Method and Deciston Towards a Bayesian Philosophy of 
Scrence. Reidel, Boston. 

SAVAGE, L.J., ET AL (1962). The Foundatons of Statistical Inference. Methuen, London. 

SCHWARZ, G. (1978). Estimating the dimension of a model. Ann Statist. 6 461-464. 

SHIBATA, R. (1980). Asymptotically efficient selection of the order of the model for estimating 
parameters of a linear process. Ann Statist. 8 147-164. 

STONE, M. (1979). Comments on model selection criteria of Akaike and Schwarv. J Roy. Statist. 
Soc. Ser, B 41 276-278, 

‘TANIGUCHI, M (1980) On selection of the order of the spectral density model fo: a stationary 
process. Ann Inst. Statist. Math. 32 401-419. 

WHITTLE, P. (1962). Gaussian estimation in stationary time series. Bull. Inst. Internat. Statist 39 
105-129. 

ZEMANIAN, A. H (1965). Distribution Theory and Transform Analysis: An Introduchon to Gener- 
alized Functions, with Applications. McGraw-Hill, New York. 


DEPARTMENT OF STATISTICS, I.A.S., 
AUSTRALIAN NATIONAL UNIVERSITY 
CANBERRA, ACT 2601 

AUSTRALIA 


The Annals of Statwics 
1986, Vol 14, No 2, 517-532 


LARGE-SAMPLE PROPERTIES OF PARAMETER ESTIMATES 
FOR STRONGLY DEPENDENT STATIONARY GAUSSIAN TIME 
SERIES 


By Ropert Fox! anD Murap S. Taqqu*?? 


Cornell University 


A strongly dependent Gaussian sequence has a spectral density f(x, @) 
satisfying f(x, 6) ~ jx% L(x) as x > 0, where 0 < a(6) < 1 and Lg{x) 
varies slowly at 0. Here @ is a vector of unknown parameters. An estimator 
for @ is proposed and shown to be consistent and asymptotically normal 
under appropriate conditions. These conditions are satisfied by fractional 
Gaussian noise and fractional ARMA, two examples of strongly dependent 
sequences. 


1. Introduction. Let X, 7 21, be a stationary Gaussian seguen with 
mean u and spectral density o*f(x,9), —7 <x <n, where h, o* > and the 
vector 9 € E c R? are unknown parameters. Denote the covariance by o”r,(6), 
so that 


E(X,- #)( Xan 7 u) = 0? (0) = 0? f e**F(x, 0) dx. 


(We are not assuming that o? is the variance of X,.) Let Ry(4) be the NX N 
~ with j, kth entry r,_,(6). Thus ofR v0) is the covariance matrix of 
-» Aq. Our object is to estimate @ and o” based on the observations 
T age 
We. are interested in strongly dependent sequences X,, that is, in sequences 
f(x, 0) ~ |x|~%®L,(x) as x - 0, where 0 < a(8) < 1 and Lx) varies slowly at 
0. These sequences have covariances that decrease too slowly to permit the 
normalized partial sums 


Sing = i t>0 
[Nt] yN ? 


to converge weakly to Brownian motion. Because of this fact, strongly dependent 
sequences play an important role in the theory of self-similar stochastic processes. 
Two examples, fractional Gaussian noise and fractional ARMA’s, are described at 
the end of this section. We will estimate simultaneously all the unknown 


Received April 1984; revised June 1986. 

'Presently at Rochester Institute of Technology, Department of Mathematics, Rochester, NY 
14623. 

2 Presently at Boston University, Department of Mathematics, Boston, MA 02216. 

*Research supported by the National Science Foundation grant ECS-80-15585 at Cornell Univer- 
sity. 

AMS 1980 subject classifications. Primary 62F12; secondary 60F99, 62M10. 

Key words and phrases. Strong dependence, long-range dependence, maximurn likelihood estima- 
tion, fractional Gaussian noise, fractional ARMA. 


517 


518 R. FOX AND M S. TAQQU 


parameters symbolized by the vector 0, and not just the exponent a(@) in 
isolation. 

A number of approaches to parameter estimation for strongly dependent 
sequences have been considered in the literature. These include the R/S tech- 
nique, periodogram estimation, and maximum likelihood estimation. Theoretical 
properties of the R/S estimates have been investigated by Mandelbrot (1975) 
and Mandelbrot and Taqqu (1979). Periodogram estimation has been considered 
by Mohr (1981), Graf (1983), and Geweke and Porter-Hudak (1983). Hipel and 
McLeod (1978) have discussed computational considerations involved in the 
application of maximum likelihood estimation. See also Todini and O’Connell 
(1979). 

The study of maximum likelihood estimation for strongly dependent sequences 
is a special case of the problem of maximum likelihood estimation for dependent 
observations. Sweeting (1980) has given conditions under which the maximum 
likelihood estimator is consistent and asymptotically normally distributed. 
Basawa and Prakasa Rao (1980) and Basawa and Scott (1983) survey theorems 
and examples in this area. In order to apply these results it would be necessary to 
study the second derivatives of Rx'(0). To avoid this difficulty we will follow the 
approach suggested by Whittle (1951). This involves maximizing 

1 | Z'A | 


1.1 m — 
a) gP 2 No?” 


Here Z= (X, - Xy,..., Xn- Xp), Xn = (1/N)JEMX,, and Ap(0) is the 
N x N matrix with entries [A,(@)],, = a,- (8), where 


(1.2) a,(0) = >z fe (f(x, 0)] de 
i (2r) = g7 ? 

For the approximation of the inverse of (R(8)) >0,;>20 by (An()),20,,20, See 
Bleher (1981) and also Beran and Kuensch (1985). Notice, that, by Parseval’s 
relation, the doubly infinite matrix A(@) with entries a,_,(@), ~œ <j, hk < œ, 
is the inverse of the doubly infinite matrix R(@) with entries r,_ (f), ~œ < 
j,k < 0. 

Thus we define estimators 6, and a2 to be those values of @ and o? which 
maximize (1/0 )exp{ — ~(1/2No2)0/A „(8)Z). This is equivalent to choosing 9, to 
minimize 


Z’A y (8 )Z 
1. 2(9) = ————_ 
(1.3) oy (9) N 
and then setting 62 = o7,(6,)). It will be convenient to use the fact that 
1 
(1.4) RCO) = =~ f [f(x OT INCE) de, 
where 


eA Xy)| 
(1.5) Iy(x) = ona sal 


at: 


ESTIMATION OF STRONGLY DEPENDENT SERIES 519 


Walker (1964) showed that 6, is consistent and asymptotically normal in 

many cases in which the sequence {X,} is weakly dependent (and not necessarily 
Gaussian). Hannan (1973) improved these results and was able to use them to 
prove consistency and asymptotic normality of the maximum likelihood estima- 
tor. Dunsmuir and Hannan (1976) gave extensions to the case of vector-valued 
observations. 
_ We show that for strongly dependent Gaussian sequences {.X,} the estimator 
Oy is consistent and asymptotically normal. The conditions under which this 
result holds are given in Section 2. Our results apply to fractional Gaussian noise 
and fractional ARMA’s. 

Fractional Gaussian noise was introduced by Mandelbrot and Van Ness (1968) 
and has been widely used to model strongly dependent geophysical phenomena. 
It is a stationary Gaussian sequence with mean 0 and covariance 


C 
r, = EX, Xap = y {ik + LPH — 21k?" + jk = 1), 


where H is a parameter satisfying t < H < 1 and C > 0. This covariance satisfies 
r, ~ CH(2H — 1)k?”-? ask>o. 
The spectral density f(x, H) of fractional Gaussian noise is given by 


(1.6) f(x, H) = CF(H) fo(x, H), 

where 

(1.7) f(x, H) = (1 — cosx) 3 Ix + 2kr| 7! 7? 0 -a sx<a, 
and _ 

(1.8) F(H) = (fa = cos x Jiz =1 =? de} 


[see Sinai (1976)]. As x — 0 we have 
CF( H 
f(x, H) ~ CUD) ao 


Fractional Gaussian noise is the unique Gaussian sequence with the property that 
Smu = ENX, has the same distribution as mS, for all m, N > 1. Further 
properties are discussed in Mandelbrot and Tagqu (1979). 

Another example to which our results apply is fractional ARMA. To define it, 
let g(x, £) = LP. x/ and h(x, $) = L9_of,x’, where £= (E5,...,¢,) and ġ = 
(d,---, Qg). Suppose that g(x, £) and A(x, >) have no zeros on the unit circle 
and no zeros in common. For 0 < d < 4, define the spectral density 


g(e", £) | 
h(e™,¢) 


A Gaussian sequence with mean 0 and spectral density f(x, d, €, ẹ) is called a 


(1.9) f(x, d,&,o) = Cle*™ — 1|7*4 ; TILST. 








520 R. FOX AND M. S. TAQQU 


fractional ARMA process. Heuristically, it is the sequence which, when 
differenced d times, yields an ARMA process with spectral density 


g(e,é) | 


te eae 


h(e*, ¢) 


Granger and Joyeux (1980) and Hosking (1981) have proposed the use of frac- 
tional ARMA’s to model strongly dependent phenomena, since 


a(1, 4) | 
h(l, $) 


2. Statements of the theorems. Let X, j2 1 be a stationary Gaussian 
sequence with mean u and spectral density o°f(x, 6), where u, o? > O0and 0 E€ E 
are unknown parameters. The set Æ c R” is assumed to be compact. Let of and 
6, be the true values of the parameters. We assume that 6, is in the interior of E. 

If 8 and 8’ are distinct elements of E, we suppose that the set {x: f(x, 0) # 
f(x, 8’)} has positive Lebesgue measure, so that different @’s correspond to 
different dependence structures. Assume the functions f(x, @) are normalized so 
that 


» 











x17"? asx > 0. 


f(x, d, $; >) ~ 3 





f log f(x,0)de=0, OEE. 
Let f ~'(x, 80) = 1/f(x, 8). 


REMARK. The condition f? „log f(x,0)dx > —co guarantees that the se- 
quence {X,} admits a backward expansion 


X, =o £ b,(@)e,_p; 
k=0 


where e,, J 2 1, are independent standard normal random variables. The first 
coefficient b,(@) is the one-step prediction standard deviation of the sequence 
Y, = X,/o. It is given by 


Lops 
b,(6) = 2mexp{ z- f” log f(x, @) ax}. 


[See Hannan (1970), page 137.] If {7 „log f(x, @) dx = 0 for all @, it follows, that 
ba(8) = 27 and so {Y,} has one-step prediction standard deviation independent 
of @. 


We will refer to the following conditions. 


CONDITIONS A. We say that f(x,@) satisfies conditions A.1-A.6 if there 
exists 0 < a(#) < 1 such that for each 6 > 0 


(A.1) g(@) = {* log f(x, 9) dx can be differentiated twice under the integral 
sign. 


iv 


ESTIMATION OF STRONGLY DEPENDENT SERIES 521 


(A.2) f(x, @) is continuous at all (x,@), x #0, f ~'(x,@) is continuous at all 
(x, 6), and 
f(x,0) = O(jx|-%-*) asx 0. 


(A.3) 8/30, f (x, 8) and 07/06, 00, f (x, 8) are continuous at all (x, 4), 
d 
apf (%8) = Oller") asx 0, I1<jsp, 
J 


and 


2 


30, 30; 





f ~(x,0) = Olx?) asx>0, 1<j,k<p. 
(A.4) 3/ðx f(x, 0) is continuous at all (x, 0), x + 0, and 


0 
Gy f(s 9) = O(jx| 7*7!) asx 0. 


(A.5) 07/dx 30, f (x, 8) is continuous at all (x, 0), x # 0, and 


2 


Ox 06, 





fx, 0) = Oljx|"-1"") asx+0, lsjsp. 


(A.6) 0°/0?x 06 f ~'(x, 6) is continuous at all (x, 6), x # 0, and 


3 


d7x 06, 





fx, 0) = Ofl) as x — Q, l<sj<p. 


REMARK. The constants that appear in the O(-) conditions may depend on 
the parameters @ and ô. If L(x) varies slowly as x > 0, then L(x) = O(|x|~°) as 
x — 0 for every 5 > 0. [See Feller (1971), page 277]. Thus (A.1)~(A.6) will be 
satisfied if the indicated continuity holds, f(x, 0) varies regularly as x —> 0 with 
exponent —a(@), 0/dx/ f(x, 0) with exponent —a(@) — 1, 0/06, f ~'(x, 0), and 
0°/06, 00, f (x, 9) with exponent a(6), 3°/ðx 06, f ~\(x,@) with exponent 
a(b) — 1, and 0°/d*x 06, f (x, 0) with exponent a(@) — 2. 


DEFINITION OF THE ESTIMATOR. Consider the quadratic form o3(@) given in 
(1.3). Let 0y be a value of @ which minimizes o4(0). Put of = oġ(0n). The 
following theorem establishes the strong consistency of the estimators y and dy. 


THEOREM 1. If f(x, 8) satisfies conditions (A.2) and (A.4), then with prob- 
ability 1 


lim 6, = 4, and lim å = of. 
N > œ N- cc 


522 R. FOX AND M. S. TAQQU 


To state the next theorem, let W(@) be the p X p matrix with 7, kth entry 





wao) = f ils, 553 99,1 (#8) de. 


THEOREM 2. If conditions (A.1)-(A.6) are satisfied then the random vector 
VN (n — b) tends in distribution to a normal random vector with mean 0 and 
covariance matrix 41W ~'(6,). 


Theorems 1 and 2 are proven in Section 3. 


REMARK 1. As in Hannan (1973), it cart be shown that the results hold if 
oX,(@) in (1.4) is replaced by 
2ak 27k 
Li ba 9 „| N l 


where —N/2 < k < [N/2]. This last expression may be useful for computational 
purposes. 


REMARK 2. 6, — 6, is asymptotically of the order of 1/ VN. Geweke and 
Porter-Hudak (1983) have obtained asymptotic results for an estimator resulting 
from a regression based on the periodogram estimates. This estimator converges 
to the true value of the parameter at a slower speed than 1/ YN. 


REMARK 3. The sample mean Xy converges to u = EX, at a slower speed 
than 6,, converges to 6, because (up to a slowly varying function in the 
normalization) N'/?~ 0/2 X, — u) converges to a normal distribution [see Taqqu 
(1975)]. 


APPLICATIONS. ‘Theorems 1 and 2 can be applied to fractional Gaussian noise 
and to fractional ARMA. In order to apply them to fractional Gaussian noise, 
restrict the parameter H to a compact subset of (3, 1) and choose the normaliza- 
tion constant CFY#) in (1.6) as 


(2.1) CF(H) = apl- —f og (1 — cos x) J jx + 2kr| 7! k a| 


k=- œo 
so that 
f log f(x, H) dx = 0. 


Similarly, Theorems 1 and 2 can be applied to a fractional ARMA process by 
restricting the parameter (d, £, ¢) to a compact set and choosing C in (1.9) as 


gle™ e) 
h(e™, ) | 








(2. 2) C= C(d, £, p) = exp ->f poslje" a ee 


ESTIMATION OF STRONGLY DEPENDENT SERIES 523 


THEOREM 3. The conclusions of Theorems 1 and 2 hold if X, ~ pu is frac- 
tional Gaussian noise with }< H <1 or a fractional ARMA process with 
O<d< ft, 


Theorem 3 is proven in Section 4. 


REMARK. We have supposed that the mean p of the sequence X, is unknown. 
If it is known, merely replace the periodogram [,,(x) in (1.5) by 
2 
T (x) = ie T #) | 
2aN 


COROLLARY 1. When u = EX, is known, Theorems 1, 2, and 3 hold if I(x) 
is replaced by Iy(x). 


3. Proofs of Theorems 1 and 2. Retain the assumptions and definitions 
made in Section 2 prior to the statement of Theorem 1. Introduce r,(@) = 
J= ,e' "f(x, 0) dx, so that E(X, ~ py X,., — H) = 067;(9). Adopt the conven- 
tion that functions defined in [ —7, 7] are extended to [—- 22,27] in such a way 
as to have period 27. 


LEMMA 1. Let g(x, 0) be a continuous function on [—7,7] X E. If (A.2) and 
(A.4) hold, then with probability 1 


lim f a(x, 0)Iy(x) dx = o8 f" a(x, 6) (2,0) dx 
uniformly in @. 


Proor. Note that J,,(x) has Fourier coefficients 


r W(k, N), |k| <N, 
rÅx sti 
fe th ee h |k] > N, 
where 
Len = = 
Wk, N) = N (X, - Xy)( Xr- Xn) 
j=l 
1 N-*# ee = 
=a L (Xe (Xn — B) {Xan — (An - 1] 
j=l 
er Caer) Nk 2 
rs a A 
= ar 4 — p) = ae + (X su) 
XS a Se a, Be ens ee he Ee 


N 
The sequence {X,} is ergodic since it is Gaussian with spectral density f(x, 9) 


524 R. FOX AND M.S. TAQQU 


that satisfies {"_ f(x, 0))dx > — oo. Therefore (Xy — u) tends to 0 as N > œ, 
as do the last three terms on the right-hand side. Hence 


1 N~R 
dim W(k, N) me N (4 u)(X jth =) 
= 057 ,( bo). 


This means that the proof of Lemma 1 can be carried out exactly as that of 
Lemma 1 of Hannan (1973). O 


PrRooF OF THEOREM 1. The proof uses Lemma 1 and the fact that 
{7 „log f(x, @) dx = 0 for all @. Proceed as in the proof of Theorem 1 of Hannan 
(1973). O 


To establish Theorem 2, we use the following four lemmas. 


LEMMA 2. Let {by} be a sequence of constants tending to œ. Let 0/00 ağ (0) 
be the random vector with jth component equal to erat, oi(8). If (A.2)-(A.4) 
hold and Y is a random vector such that by, 8/30 03,(6,) tends to Y in distri- 
bution as N -> co, then by(9, — 9) tends to (—27/ag¢)W ~'(6,)Y in distribution 


as N > oo. 


Proor. Let 07/0070%(0) be the p X p random matrix with j, kth entry 
3?”/ 30, 30, 0%(8). According to the mean value theorem 


2 


a ð 0 
59 ON ON) = 59 ON (9 )+ (Szok a-e ba), 


where |0* — ôl < |8xy — 95|. Since 0, is in the interior of E, Theorem 1 implies 
that R is in the interior of E for P N. Since 6, minimizes 0/(6), it follows 
that 0/00 07(6,) = 0 for large N. Thus for large N 


d a? 
zgon (90) i p pa) Oy). 


Because 
2 


a ER ae “Wx, O)Iy(x) dx 
90, 90,7" = ae) aa30,! ae aa 


it follows from Lemma 1 and Theorem 1 one with probability 1 





2 


0 
—— 0t 8 
30, 06, on( R) > yf b): 
Therefore 
02 
-bw WM) (By — 8) 
qr 


tends in distribution to Y, completing the proof of Lemma 2. 0 


ESTIMATION OF STRONGLY DEPENDENT SERIES 525 


LEMMA 3. Tf (A.1), (A.2), and (A.3) hold then 


w,,( 4) = 4 


9 -1 -1 
PA eo) spot (x, al f *(x,@) dx. 


Proor. If the right-hand side is denoted J, then by (A.1), 


g? a 32 
o= f ggo tied A f aa E de 
Hence 
a 8? 
w0) = J A=) aa “I(x, 0) dx 
= 2J- fi 0) 59 99,9) a 


=J. D 


LEMMA 4. Tf conditions (A.2) and (A.4) hold, then for every ô > 0 
r,(0) = O(R™9) 148) ask > oo. 


Proor. Fix 0 and put f(x) = f(x, @). Since f is periodic, 
E e*| f(a) - {z+ 5 +- | 
< T f(x) - f(x + =) 


— 2q /k k F 
[rns fee fe 
-y ~ 2r / k /k 


T 


2|r4(6)| ms 





dx 








Conditions (A.2) and (A.4) imply that there is a constant C = C(@, ô) such that 
f(x) s C ans 


and 


0 
— f(x)| < CeT 
Ox 


for x bounded away from +27, say |x| < 2a — 1. (Since f is periodic, f need not 
be continuous at 27.) By the mean value theorem 


f -2or/k 
a 








-a{ĝ) -1—8 
277k] i Vis 
dx 





f(x) - fla + =| 





de ezf 








a —2/R a, ay OC RO te) 


k —a+a/k 


526 R. FOX AND M.S. TAQQU 
as k ~ oo. A similar argument shows that 


A 


a/ 





fis j + z |a = O( R901 48), 


We also have 


Pe ides glare £0 face f° as Ge 





JR gy 
<C{" |x| 68) 8 dx 
—2a/k 


—a(O)—8 


dx 


k T 
+f” A 
20 /k k 


= 2c i ie onn = OLR abet), 
—2a/k 








This completes the proof of Lemma 4. O 


LEMMA 5. If conditions (A.3), (A.5), and (A.6) hold, then for every 5 > 0 
and every 1 <J <p, 


: a 
Í etx) f “Vx 6) | dx = O(k7*®-1+8) ask => ov. 
20, 


PROOF. Since 4/06, f ~ (x, 8) is symmetric, integration by parts yields 
d 
tok thx -1 
e,(0) : fe TA (x, 0) dx 


if thx a -I A) dx 
e gg 


The argument in Lemma 4 can now be applied since 
2 


Ox a6, 








f ~'(x, 0) = Olx?) withO < —(a(@)-—1) <1. 
We thus get 
e,(0) = Í O(p-(ao- 9-148) 
k 
= O(kR~%)-1*8) ask > œ. © 


The proof of Theorem 2 uses the following result which is a consequence of 
Theorem 4 of Fox and Taqqu (1983). 


PROPOSITION 1. Let f(x) and g(x) be symmetric real-valued functions whose 
sets of discontinuities have Lebesgue measure 0. Suppose that there exist a < 1 


ESTIMATION OF STRONGLY DEPENDENT SERIES 527 


and B < 1 such that a + B < } and such that for each 6 > 0 
f(x) = Olx 757") asx > 0 
and 
g(x) = O(jx|-8-*) asx 0. 


If {X,} is a stationary, mean 0, Gaussian sequence with spectral density f(x), 
then 


IN| f" Iy(x)a(x) de — Bf” Iy(x)a(x) ax} 
tends in distribution to a normal random variable with mean 0 and variance 


anf [i (x)e(x)]? ax, 


REMARK. Proposition 1 is established by showing that the cumulants of 
order greater than two tend to zero as N > oo. Non-Gaussian limits may occur if 
the {X,} are non-Gaussian [see Fox and Taqqu (1985)]. 


PROOF OF THEOREM 2. Let a, denote a(b). Define my = E 0/00 of(0,) and 
let my, , = E 0/00, 0f (0,) be the jth coordinate of my. Let c,,...,¢, be fixed 
constants and consider the random variable 


w- Eola on (9) — mn 


mi 


mat 


-g 


2c 

Ees 30, 
Under condition (A.3) the function in brackets is O{|x|*°~°) as x —> 0 for every 

ô > 0. Apply Proposition 1 with 

a=, 


B= — Ags 
f(x) = of (x, bo), 


(x, 8 aora 3 CMy, 5: 


j=l 


and 
g(x) = Le ig (x, bo), 


and conclude that VN Y, tends in distribution as N —> œ to a normal random 
variable with mean 0 and variance s* given by 


-2f f*(x, Eon ogg! Eto) dx 


0 
-$ faif pengeni] 


528 R. FOX AND M.S. TAQQU 


An application of Lemma 3 yields 


P? P Oy 
g“ = » » C,C,— W,,( 8). 
Jel km] ut 


Since c,,...,¢, were arbitrary, we have shown that VN (0/06 02,(0)) my) 
tends in distribution to a normal random vector with mean 0 and covariance 
matrix 0; /7W(6@,). Therefore Theorem 2 will follow from Lemma 2 if we show 
that under the conditions of Theorem 2 

an VNmy ,= 0, AE EEE Ds 
To prove this, define 
E : -x OT de 
EN, = =f ai (x, 0 Mynx) , 
where 

2 

h(a) = Dorie — e) 

N ay 


It follows from Lemma 8.1 of Fox and Taqqu (1983) that lim y, „YN On Nan 
tin. ;) = 0. Thus it suffices to show 


(3.1) lim yNpy ,= 0, {=1,...,p. 
Noe i 
We have 





2 2 -k (O))( X, ~ 2)(X, - g), 


EN, 7 
= IN jel ke l 


where 
0 
e= enO) = fe sei M(x, 6) dx, 
Set also r, = r,(9,). Then 





ETS 
H r 
N,” (2r) KOE —k'j—- k" 


Note that e,r, is the kth Fourier coefficient of the convolution 


s] d 
h(x) = ME SIY, Oe — x, Oy) dy. 


Note also that 


l! 


„ri ð 
h(0) J (a KEDLEN 


I 


af 51o fC, 0) dy = 0, 


where we have used (A.1). 


ESTIMATION OF STRONGLY DEPENDENT SERIES 529 


To prove (3.1), observe that by Lemmas 4 and 5, there is a 0 < 6 < + such that 
as k > œ, 


(3.2) er. = OR). 
Observe also that e,r, is the kth Fourier coefficient of h, so that 
(3.3) > e,r, = h(0) = 
ke —x 
We have 
a DA eine J> ko) —k 
“ga NPN 17 = /N— N 


k 
= yN y eure] == =] 
jki <N N 
= Ni? 2 Ekk aE NIAZ » ke Fg. 
kI <N kI <N 
Because of (3.3), the first term equals — N'È k> ve47%, which is O(N 7'2 +28) 
by (3.2). The second term is also O(N ~'/7+?*) by (3.2). These terms tend to zero 
as N > oo, establishing (3.1). This completes the proof of Theorem 2. O 


REMARK. Conditions (A.4), (A.5), and (A.6) were used in the proofs of 
Lemmas 4 and 5 to show that r, = O(R™®~'*®) and e, = O(R~ %%~'+8) as 
k — oo. In specific cases, a Tauberian theorem may be applied to f(x, @) and 
3/30, f ~'(x, 8) to yield such estimates on r, and e,- 


4. Proof of Theorem 3. In order to verify that conditions A are satisfied for 
fractional Gaussian noise and fractional ARMA’s, it is convenient to check the 
following conditions which are stronger than conditions A. 


CONDITIONS B. We say that f(x, @) satisfies conditions B.1—B.4 if there is a 
continuous function 0 < a(#) < 1 and constants C(5), C,(5) such that for each 
6>0 


(B.1) f(x, 0) is continuous at all (x, 0), x # 0 and 
f(x, 8) = C8) 7+. 
(B.2) f(x, 0) < C(8)x ~~. 
(B.3) 0/06, f(x, 0) and 3?/30, 06, f(x, 8) are continuous at all (x, 8), x # 0, 


AA a E E 





: 0 
zg ) 


and 
2 


~ a(6)— è . 
30 96, T) < C(8)|x| 1 <sj,k <p. 


f(x, @) 








A Aaa 


530 R. FOX AND M. S. TAQQU 


(B.4) 0/dx f(x, 9), 0°/dx dð, f(x, 8), and d°/ax? 06, f(x, 0) are continuous at all 
(x,0), x # 0, 


a 
Pala) < C(8)jx|-2- 1, 
ox 








g? 
on 99 ie) LOG er". asp; 
J 
and 
g’ 
5299 (50) < C(8)x 713, sis. 
J 








Note that conditions B do not involve the function f ~'(x, 0). The constants 
C(é) and C,(6) which appear in conditions B are required to be independent of 8. 


LEMMA 6. Tff satisfies conditions B.1—B.4, then f satisfies conditions A.1—-A.6. 


PRoor. Suppose that 7 satisfies conditions B. It is easily seen that conditions 
A.2—A.6 are satisfied. For example 





o 

~~ f(x, 0 
0 f(x, 6) 0, l / C(é) 

aa x, Set Sg aes es 

30, f*(x,8) = Cy(S) 

This implies that 0/06, f ~'(x,@) is continuous and that 0/06, f ~'(x,@)= 

Of |x|- 38) as x > 0. 

We check that condition A.1 is satisfied. Let v, be the yth unit vector in R?, 
that is, the vector with jth component equals 1 and all other components equal 
0. Then we have 
f? ,log f(x, 8 + ev, ) dx — {* log f(x, 0) dx 


E 


_ f log f(x, 0 + ev,) — log f(x, 8) a 


E 


lx- 38 











By the mean value theorem this integrand is majorized for each x # 0 by 


d 
ee * 
; met ga (*> 6*(x)) 
——log f(x, O*(x Se 

a6, f(x, 0*(x)) 
where |#*(x) — @| < |e]. Under conditions B.1 and B.3 this quotient is at most 
C(8)/C,(8)\x|% ~~ 22) where 


ae miralo 
a, = mina( 4) 








oe 
— 








ESTIMATION OF STRONGLY DEPENDENT SERIES 531 


and 
oy = maxa( 6). 


Since a,, — @y > —1 we can choose ô so that a,,— a, — 26 > —1, and thus 


f” jaam? de < o. 
—F 


Hence the dominated convergence theorem implies that f7 „log f(x, 6) dx can be 
differentiated under the integral sign. A similar argument shows that a second 
differentiation under the integral sign can also be performed. O 


PrRoor oF THEOREM 3. For fractional Gaussian noise, f(x, H) = 
CF( H)},(x, H), where f(x, H) is defined in (1.7) and CF(/Z) is defined in (2.1) as 


CF(H) = exp) — = f 108 f(x, H) de). 


According to Lemma 6 it suffices to show that f(x, H) satisfes conditions 
B.1-B.4 with a( H) = 2H — 1. We will show that f,(x, H) satisfies conditions B 
with a(H) = 2H — 1. Then Lemma 6 implies that CF(H) is twice continuously 
differentiable, which means that f(x, H) satisfied conditions B. 

Note that 


fo(x, H) = (1 — cosx)[|x|~*~?" + f,(x, H)], 
where 
fi(x,H)= }, jx + Qkal-)- 24, 
k#0 


Since 1 — cos x ~ |x|*/2 as x — 0, conditions B will hold for f,(x, H) if f(x, H) 
is three times continuously differentiable at all (x, H). A standard theorem on 
differentiation of series [Theorem 7.17 of Rudin (1964), for example] shows that 
this is indeed the case. Thus conditions B are satisfied for fractional Gaussian 
noise. 

It is even simpler to verify conditions B for a fractional ARMA process 
because the divergent term is already factored out in that case O 


Acknowledgments. We would like to thank Hans Kuensch for suggesting 
the proof of Lemma 5. We also thank the Associate Editor and the referees for 
suggestions which helped us improve the presentation of the paper. 


REFERENCES 


Basawa, I. V. and Prakasa Rao, B. L. S. (1980). Statistical Inference for Stochastic Processes. 
Academic, London. 

Basawa, I. V. and Scott, D. J. (1983). Asymptotic optimal inference for nonergodic models. 
Springer Lecture Notes in Statistics 17. Springer, New York. 

BERAN, J. and KUENSCH, H. (1985). Location estimators for processes with long range dependence. 
Preprint. 

BLEHER, P. M. (1981). Inversion of Toeplitz matrices. Trans. Moscow Math. Soc. 1981 (Issue 2) 
201-229. 


532 R. FOX AND M.S. TAQQU 


DUNSMUIR, W. and HANNAN, E. J. (1976). Vector linear time series models. Adv. in Appl. Probab. 8 
339-364. 

FELLER, W. (1971). An Introduction to Probability Theory and Its Applications 2. Wiley, New York. 

Fox, R. and Taqqu, M. S. (1983). Central limit theorems for quadratic forms in random vanables 
having long-range dependence. Technical Report 590, School of Operations Research and 
Industrial Engineering, Cornell Univ. 

Fox, R. and Taqqu, M. S. (1985). Noncentral limit theorems for quadratic forms in random variables 
having long-range dependence. Ann. Probab. 13 428—446. 

GEWEKE, J. and PoRTER-HUDAR, S. (1983). The estimation and application of long memory time 
series models. J. Tune Ser. Anal. 4 221-238. 

GRAF, H.-P. (1980). Long-range correlations and estimation of the self-similarity parameter. Ph.D. 
dissertation, Swiss Federal Institute of Technology. 

GRANGER, C. W. J. and JOYEUX, R. (1980). An introduction to long memory time series models and 
fractional differencing. J. Tune Ser. Anal. 1 15-30. 

HANNAN, E. J. (1970). Multiple Tune Series. Wiley, New York. 

HANNAN, E. J. (1973). The asymptotic theory of linear time series models. J. Appl. Probab. 10 
130-145. 

HIPEL, I: and McLEoOD, A. (1978). Preservation of the rescaled adjusted range 1. Water Resources 
Research 14 491-508. 

HOSKING, J. R. M. (1981). Fractional differencing. Biometrika 68 165-176. 

MANDELBROT, B. B. (1975). Limit theorems for the self-normalized range. Z. Wahrsch. verw. 
Gebiete 31 271-285. 

MANDELBROT, B. B. and Taggu, M. S. (1979). Robust R/S analysis of long run senal correlation 
Proceedings of the 42nd session of the International Statistical Institute, Manila. Bulletin 
of the I.S.I. 48 (Book 2) 69-104. 

MANDELBROT, B. B. and VAN Ness, J. W. (1968). Fractional Browniar. motions, fractional noises and 
applications. SIAM Reo. 10 422-437. 

Monr, D. (1981). Modeling data as a fractional Gaussian noise. Ph.D. dissertation, Princeton Univ. 

Rubin, W. (1964). Principles of Mathematical Analysis. McGraw-Hill, New York. 

SINAI, YA, G. (1976). Self-sumilar probability distnbutions. Theory Probab. Appl. 21 64-80. 

SWEETING, T. J. (1980). Uniform asymptotic normality of the maximum likelihood estimator. Ann. 
Stanst. 8 1375-1381. 

Tagqu, M. S. (1975), Weak convergence to fractional Brownian motion and to the Rosenblatt 
process. Z. Wahrsch. verw Gebiete 31 287-302. 

TODINI, E. and O'CONNELL, P. E. (1979). Hydrological Sumulation of Lake Nasser 1. Institute of 
Hydrology, Wallingford, U.K. 

WALKER, A. M. (1964). Asymptotic properties of least squares estimators of parameters of the 
spectrum of a stationary non-deterministic time series. J. Austral. Math. Soc. 4 363-384. 

WHITTLE, P. (1951). Hypothesis Testing m Tume Serves Analysis. Hafcer, New York. 


ScHOCL OF OPERATIONS RESEARCH 
CORNELL UNIVERSITY 
ITHACA, NEW YORK 14853 


The Annals of Statistics 
1986, Vol 14, No 2, 433-558 


LIMIT THEORY FOR THE SAMPLE COVARIANCE AND 
CORRELATION FUNCTIONS OF MOVING AVERAGES 


By RICHARD Davis! AND SIDNEY RESNICK? 


Colorado State University 
Let X; = Eu -o CZ, be a moving average process where the Z,’s are 


ud and have regularly varying tau probabilities with index a > 0. The lmit 
distribution of the sample covariance function is derived in the case that the 
process has a finite variance but an infinite fourth moment. Furthermore, ın 
the infinite variance case (0 < a < 2), the sample correlation function is 
shown to converge in distribution to the ratio of two independent stable 
random variables with indices a and a/2, respectively. This result im- 
mediately gives the limit distribution for the least squares estimates of the 
parameters in an autoregressive process. 


1. Introduction. We consider the discrete time moving average process 


(1.1) X,= Li cZ 

jm 0 
where {Z,, —0o <¢< oo} is an independent and identically distributed (iid) 
sequence of random variables with regularly varying tail probabilities. More 
specifically, we assume 


(1.2) P(|Z,| > x) = x ~°L(x) 

with a > 0 and L(x) a slowly varying function at co and, 
P(Z, > x) P(Z, < —x) 

(1.3) q 


Se Sp An cieee 
P( |Z| > x) P( |Z,| > x) 


as x > œ, O<ps<l1 and q=1-—p. Under these assumptions on the noise 
sequence, the series defined in (1.1) exists (cf. Cline, 1983) provided 
(1.4) » lel? < oo forsome0 <8 <a,8 <1. 


J= =% 


Note that any stationary ARMA process driven by the {Z,} sequence, has such a 
representation. 

There has been increasing interest in modelling certain time series phenomena 
by an ARMA process with heavy tailed noise variables. For example, certain 


Received June 1984, revised June 1985. 

'Partially supported by grants NSF DMS 8202335 and AFOSR F49620 82 C 0009. Portions of the 
work were accomplished while visiting the Center for Stochastic Processes at the University of North 
Carolina. Their hospitality is gratefully acknowledged. 

? Partially supported by NSF grant DMS 8202335. 

AMS 1980 subject classifications. Primary 62M10; secondary 62E20, 60F05. 

Key words and phrases. Sample covariance and correlation functions, regular variation, stable 
laws, moving average, point processes. 


533 


534 R. DAVIS AND S RESNICK 


signal processes appear to be modelled better when the signal and /or noise has a 
heavy-tailed distribution rather than a Gaussian distribution. Mertz (1965) and 
Stuck and Kleiner (1974) have demonstrated this for telephone signals, as has 
Evans (1969) for signals with ELF noise and Rybin (1978) for strong narrow band 
signals. Fama (1965) has similarly modelled stock market prices. Reeves (1969) 
has considered air turbulence and Safiullin and Chabdarov (1978) have investi- 
gated radio navigation with processes involving non-Gaussian noise. The ARMA 
model is usually the basis for such processes. 

In Davis and Resnick (1985), the weak limit behavior of the sample covariance 
function for the {X,} sequence was derived in the 0 < a < 2 case. It then 
followed immediately that the sample correlation function (h) = 
EIF X, Xian X7, h > 0, converges in probability to the analogue of the 
correlation function defined by p(h) = LPa -w CC th Eoo -o ê. A more refined 
result for the sample correlation function from an AR{ p) process with errors 
satisfying (1.2) and (1.3) was given by Kanter and Steiger (1974) and Hannan and 
Kanter (1977). They proved that for any 6 > a, 


n'*(B(h) — p(h)) >p 0 


with a similar result holding for the least squares estimates of the parameters in 
the AR( p) model. Yohai and Maronna (1977) also considered AR( p) processes 
and showed that n'/*(6(A) — p(h)) is bounded in probability provided the Z,’s 
are symmetrically distributed and E log*|Z,| < œ. We provide a much more 
precise description of the limiting behavior of 6(A) for infinite order moving 
averages which includes as a special case the AR( p) process considered by the 
above authors. Of course if the Z,’s have a finite variance then n'/*(6(h) — p(A)) 
is asymptotically normal under mild restrictions on the coefficients {c,} (cf. 
Anderson, 1971, page 489). 

In Section 2, the limit distribution of the sample covariance function is derived 
for the case 2 < a < 4. In the special case, 2 < a < 4, the process has a finite 
variance but an infinite fourth moment. It turns out that, as in the 0 <a < 2 
case, the limit behavior of the sample covariance function is determined by the 
partial sums 27_,Z7?. We also consider in Section 2 the situation when Z? 
belongs to the normal domain of attraction with an infinite variance. 

The weak limit of the sample correlation function in the infinite variance case 
(0 < a < 2) is considered in Section 4. It is shown that there exists a slowly 
varying function at oo, L(-), such that n'/*L(n)(6(h) — p(h)) converges in 
distribution to the ratio of two independent stable random variables with indices 
a and a/2, respectively. If the tail distribution of |Z,| is asymptotically equiv- 
alent to a Pareto (as is the case when the Z,’s have a stable distribution), then we 
may take L(n) = (log n)~'/*. Whereas the asymptotic properties of the sample 
covariance function are governed by the partial sums ¥7_, Z7, the weak limit 
behavior of the sample correlation function is determined by the vector of partial 
sums (L", 27,07, ZZ- on, ZZ) In Section 3, we show that this 
sequence of vector-valued random variables converges in distribution to a vector 
(So: S,,...,S,) of independent nonnormal stable random variables. This result is 
proved using point process techniques and ideas from extreme value theory. The 


LIMIT THEORY OF MOVING AVERAGES 530 


limit random variables for the sample correlation function p(h) can then be 
identified as 


¥,= ¥ (lh 47) 4 o(h= 7) = 20(olW))S/G,. 


yok 


In the classical case (var(Z,) < œ) the same result is true where the S,’s 7 2 1 
are iid N(0, 1) rv’s and S, = 1 and this provides an easy way to compute asymp- 
totic covariances of the #(/)’s. Further discussion on this point is contained in 
Section 4. 

The limit results derived for the sample correlation function enable 6(h) to be 
used for model identification and estimation of parameters in the class of ARMA 
models. In particular, limit distributions for method-of-moments type estimators 
of the parameters in an ARMA process can be derived (some examples are 
considered in Section 5). These estimators will be weakly consistent regardless of 
the value of a. On the other hand, if more detailed information about the 
distribution of the residuals is known (for example the value of æ in (1.2)) then 
there may be better estimation methods such as minimizing the a-dispersion 
(Stuck, 1978, and Cline, 1983) or minimizing absolute deviations (Bloomfield and 
Steiger, 1983). In the absence of knowledge about a, one may fall back on the 
following iterative procedure: (a) obtain preliminary estimates of the parameters 
using the sample correlation function; (b) estimate a based on the resulting 
estimated residuals from (a) (cf. Hall (1982), DuMouchel, 1983); (c) update the 
estimated parameters by minimizing the a-dispersion between the predicted and 
observed values. 


2. Sample covariance function. The aim of this section is to derive the 
weak limit of the sample covariance function for the process {X,} satisfying (1.1) 
with 2 < a < 4. Assume 


(2.1) X,= }, cZ, with }, |e] <0, 


jJ=u-— 2 j=z-K 


where the Z, satisfies (1.2) and (1.3). Put a, = inf{x: P(|Z,| > x) <n7'} and 
define the sample covariance function by 


$(h) = F AAs. be. 
f=] 


The following proposition is the key step in evaluating the weak limit behavior of 
YA). 


PROPOSITION 2.1. If 2<a< 4 and EZ, = 0, then for every positive integer 
h, 


n a o 
(2.2) a,*int(h)- E L Ceara) >p 0. 


lal] jm — y 


536 R DAVIS ANDS RESNICK 


Proof. We have 


a;?| 2; XiX ith = x 2 atraz 


t=] fm lim -x 


a,*{ 3 DON rea] 


fm] iss 


n 
= a,” 5 2 Gesail iJsa,] — tN Ziz NS ay] = Bn) 


i=l imy 


n 
-2 
Fa, Bp ey Bee Zeli 1S ay] + Z, liz, ieai) 
t=) ty 


n 
—2 
a n 2 | C,Ch- Zi- Anz, |>a, or |Z, |> an] 


fo] ÆJ 


~ 2 
Tna, TAD CC 4h 
ty 


=A+BiC+D, 
where u, = EZ,1y7)<4,}- We shall show that A, B, C >p0 and D > 0. 
Define 
Zin = Sl pizy<a,] Pn 


and we have 
n n 
var( A) az“ 2 om L » OC a gO ppp D 2pm Pgh enon) 
b=] fl 1%; kel 


Since {Z, ,, — 0 << œ} is for each n an iid sequence of zero mean random 
variables, the above expectation is zero unless {f—1,t—j} = {s—k,s-— l}. 
When this is the case, the expectation is of the form 


EZ) £3, = EZ), ,EZ3, n 


2 
2 eat 
S (EZiltzisa,)) ~ Ons 


where o? = EZ? MuAlse, I Hence 


var( A) < a, tor 3 3 » (lelle CER Cccp ial nrak 


b=] =] >j 
PECA eA 
n n 2 
4—4 
Onan ys 2 parae O TE Dleyentren-cl 
p=] j=l I l J 


< ofa;‘n X (Eiaeai) + (Elassasi)(Eie sate) 


<n I t 


< o( Ziel) naz tes 


LIMIT THEORY OF MOVING AVERAGES 537 


For 2 < a < 4, o” has a finite limit and in the a = 2 case it is slowly varying by 


Karamata’s theorem (Feller, 1971). So in either case o,* is slowly varying. 


Moreover, a, = regularly varying with index 1/a, which together with the slow 
variation of oʻ, implies naz ʻo‘ — 0 as n > oo. Thus, var( A) — 0 as desired. 
As for the term B, we have 


2 
E|B| < 2na;Iuql Died) BZ aie 


2 
< 2(Zle4) EZ, naz’len: 


Since EZ, = 0 by assumption, 
a-la, 
Hal = EZ 1hz,\> a] S BZ Hizi a, ~ t A 


by Karamata’s theorem. Hence na7*ip„| > 0 as n > œ. 
Next 


2 
E|C sS naz?( Xle.) EZ Zallizi> a, or [Z,|> a, ] 
l 


2 
< 2na;,*( Die} EZEZ Laz i> a] 


> 0 
by Karamata’s theorem as for B. Finally, D = O(na;*n?) > 0 since for B we 
have already proved na>*\n,| —> 0 and this completes the proof. 0 


For a > 2, define 
y( h) ii cov( X,, Xinh) 


=o È oeu) 


where o? = var(Z,). The next theorem gives the main result of this section. Here 
and in what follows, convergence in distribution is denoted by “ = .” 


THEOREM 2.2. Suppose {X,} is given by (2.1) where {Z,} satisfies (1.2) and 
(1.3) with 2 < a < 4. If EZ,=0, then for any positive integer l 


(2.3) (na;Z*(}(h) ~ by, n),0 she L) = (Ze, Leene Liege Cyt 


where S is a stable random variable with index a/2 and b, n= 
E o OC A EZi1 pz) <0,10 <A <L Moreover, if 2 < a < 4, then 


(2.4) (naz?($(h) = y(h)),0 shes L) a [s E SICON y(1))/o?. 





538 R. DAVIS AND S. RESNICK 
Proor. By Theorem 4.1 in Davis and Resnick (1984), 


n x X 
az” oe 2 Cc Zes = op ) = » C,C, 445 for all h = 0, 
fel ia- bo oC 


where op = EZf1,7,.,,; and S is a stable random variable with index a/2. 
From the proof of this same theorem, we have for any positive integer l 


n n n 
az?| 2 eZ Gij o2), 2 Da en Ade ae el ee 2 A E A _ 0,7) 


fm] 4 tel t t=l & 


=> S Ee, DECi Dogal 
/ + J 
This combined with Proposition 2.1 proves (2.3). 
If a > 2, then o? > o° and by Karamata’s theorem, 
a 


2—2 2—2 eee ~2 2 
no“a, —no,a,° = na, EZ lyz)>4,) 7 are 


so that by the convergence of types result, (2.4) holds. 0 


COROLLARY. The same lmit law is attained in Theorem 2.2 1f Xh) ts 
replaced by a mean corrected version 
` 1 n-h _ es _ 1 2 
Wh) =— (X,-X)(X,.,-X), whereX=— } X,. 


N tml N =] 


The proof of this corollary is analogous to that of the corollary following 
Theorem 4.2 in Davis and Resnick (1985) and is therefore omitted. Also note that 
the corollary remains true if EZ, # 0 by considering the process X, — EX, = 
Laca G lea Blic) 


Corresponding to the case a = 4 we have the following result. 


PROPOSITION 2.3. Suppose { X,} is defined by (2.1) with EZ, = 0 and 
EZ) az) <n ü L(t) 


is slowly varying with lim,_,,, L(t) = œ. Define a, by 


nL(a°) 


2 


a, 


— 1]. 


so that a,, is regularly varying with index 4. If a, = œ then in R"! 
(2.5)  (naz?(#(h) — y(h)),0 < h <1) = N- (7(0),..., x(1))/o?, 
where N is a N(0,1) random variable. 
REMARKS. (1) Define L,(x) = L(x'/*) so that L, is also slowly varying (de 


Haan, 1970, page 21). Then a, must satisfy nL,(a,)/a2 > 1. Set U(x) = 
x?/ L (x)so that U, is regularly varying with index 2 and a, satisfies U (a) ~ n, 


x: 


p 


LIMIT THEORY OF MOVING AVERAGES 539 


and this shows a, may be taken as the asymptotic inverse of U, at the point n 
(cf. Seneta, 1976, page 21). 
(2) For the classical result assuming EZ} < oo, see Anderson (1971, page 478). 


Proor. We begin by showing the analogue of Proposition 2.1. The difference 


a,” n¥(h) = 2, 2, (Cien) Zp 


t=] pm e 
is again decomposed into the pieces A + B + C + D. 
We have vay A) = O(na;*). Since L(t) > œ we have a,/ yn > œ and hence 


naz‘ = na}? — 0, 


n 


as desired. For B we have 
E|B| < (const)naz*E|Z,\lyz1> a] 
Since L(t) + fjz*P[|Z,|e dz] we have 
06 e 4 
E\Z\\zi>0,)= | 2P[lZjede]= f t’L(dt) 
an A a 


= 3f L(s)s~ ds — L(a,,)a;,” 


=a,"L(a,){ {-3(L(a,8)/L(a,))s~*ds = 1), 
so that 
naz?E\Z luzi» a1 = nLang" f 3(L(a,8)/L(a,))s~*ds = 1}. 
However, since nL(a,)a>* — 1, the above term is asymptotic to 
az [~3(L(a,8)/L(a,))s~*ds ~ 1}, 


which goes to zero since a, — oo and the expression within the braces goes to 
zero by Karamata’s theorem. The term E|C| is handled in the same way and D 
is of smaller order than E|B| so the analogue of Proposition 2.1 is proved. O 


Before continuing with the proof we need the following result. 


PROPOSITION 2.4. Suppose {X,} satisfies (2.1) with EZ, =0 and U(t) = 
EZ? yz, <1 slowly varying. Define g, by 


nga EZiloziss,) 7 L 
Then 


where N is N(0,1). 


540 R. DAVIS AND S. RESNICK 


PROOF. A proof can be fashioned after the method used in Davis and Resnick 
(1985) to prove Theorem 4.1. We have Z, in the domain of attraction of the 
normal so that g7'LiZ, = N. Furthermore for m 2 1 


A ler È Zijl sS m) a (N, N,..., N) 
t=1 
in R?™*! and therefore by the continuous mapping theorem 


C a eX = (X c)N. 


|sm 


It remains to show 


(2.6) lim lim sup Phi >) X,— (Coms Om) * Yal > | = () 


ME np t=] 


for any 6 > 0 as well as 


(2.7) | Y av =| > oN m > oo. 


ism J” 


The validity of (2.7) is obvious. 
We have that 


E'$ X, — (C mre Cm) N =g: De PAN 
j=l 


f= 1] jj|>m 


n 
T b 2, 2 el Ziz PES: ~ EZ iiien) 


fe] |> im 


+gz’n| 2s c, \BZ Ay cas 


|y|>m 


+g7! a | D c, leat, A> Erl 


t=] i> m 
=a + B+ y. 

Now 

lim limsup P[|a| > 6] = 0 

moo no 
by an argument identical to one used in the proof of Theorem 4.1 of Davis and 
Resnick (1985). (We use the fact that ng,*EZ?1)7)<,.— 1.) For the other two 
terms we calculate 


EZ Lye 


Ll < Be)! = 


< ElZilluzise = f tPiZledt] = f tUle) 


IEZ lizi> g] 


= [°s*U(s) ds - g,'U(E,); 
En 


-— 


LIMIT THEORY OF MOVING AVERAGES O41 


and so applying Karamata’s theorem (recall U is slowly varying) we get 
7 EnEZlizisg,] 
neo hC 
no U( En) 
Thus 


|B| s 8, MEZ Lizig) » le, 
|J|> n 


nU( g,,) BlEZ lie sg] 
En U( En) j > am 


~ Eal EZ lizis g] 2, le |/U(g,) 
| 


J\>m 





Ic,| 


>(0 asn- o. 
Likewise 


lim limsup P[]y| > 6] < lim limsupô`'g;'n|EZ luzia] 2 lel 
ji li> m 


MXL n= n> 


= 0, 
as desired for the verification of (2.6). 0 


CONTINUATION OF THE PROOF OF PROPOSITION 2.3. From Proposition 2.4 we 
have (recall o? = EZ? = var(Z,)) 


n oo oC 
a 2 » 6.0.49 Ze, ~ o?) = | 2 acua )N 


f=] (m — o pem OG 


and hence, from the analogue of Proposition 2.1, 
naz? (ICh) — y(h)) = (Zeca) N. 
The assertion of Proposition 2.3 easily follows. 


REMARK. The same limit law holds if 7(/) is replaced by the mean corrected 
version: 


Lb. 28 = sae 
7(h) B -LX 7 X (Xien a X). 
1 

3. Sample covariance function of {Z,}. Assume {Z,} is iid and satisfies 
(1.2) and (1.3) with 0 < a < 2. As before define 
(3.1) a, = inf{x: P(|Z,|>x) <n7'}. 
Applying Theorem 4.2 in Davis and Resnick (1985) to the Z, sequence (i.e., take 
c, = 0, J # 0 and cy = 1), we obtain 


a7? } Z7,Z,,,25:-0=0 forallh>0 


f=] 


542 R. DAVIS AND S RESNICK 


and 


n 
a, L Zr >S, 
t=] 
where S is a positive stable random variable with index a/2. In this section, we 
give a different normalization for the partial sums L7_, Z,Z,4,, 2 > 0 in order to 
get a nondegenerate weak limit. Not surprisingly, these partial sums (1.e., sample 
covariances) at different lags turn out to be asymptotically independent. This will 
be the main building block for deriving the limit distribution of the sample 
correlation function of the X, process in the next section. 
Throughout this section we shall assume E|Z,|* = œ. It then follows from 
Theorem 3.3(iv) in Cline (1983), that the product ZZ, belongs to the a-domain 
of attraction. That is, Z,Z, satisfies 


P(|ZZ,| > tx) 


(3.2) Pizzi 7 as t> œ, x >Q, 
and 
(3.3) aa op + -p as t > 00, 
where p is given in (1.3). 
Define 
(3.4) @, = inf{x: P(|ZZ, | >x) < n™'}. 
We first show that 
(3.5) a,/a,—7 ©. 


Observe that for a fixed positive number M, 
P(|Z)Z,| > £) P(\Zo| > t/|Z,|,|Z,| < M) 


PZ >t) T P(Zo| > t) 
B mu P(|Zol > t/y) 
= psy ZeD) 


We then have, by Fatou’s lemma and (1.2), 


Aci P(|Z,Z,| 2 t) M a 

lim inf P(Z|>t) = f y*P(|Z, |e dy) 
and upon letting M > oo, the lower bound converges to E|Z,|* = œ. It now is 
easy to check that (3.5) must hold. 

The joint asymptotic behavior of the partial sums È}; 27,07, ZiZa 
...,3.,2Z,Z,,;,) is handled using point processes techniques. For background on 
point processes, see Kallenberg (1976). Set Y, = (Z, Ziti. -3 ZZt+h) for 
t= 0,+1,+2,... and define a,'Y, = (a,'Z,, @,'Z,Z,,1,..., @,'Z,Z,,),). The 


LIMIT THEORY OF MOVING AVERAGES 543 


relevant sequence of point processes for this problem is given by 
In Di a Ea tY 
i=] 


which is defined on the state space E = R”*'\ ((0,0,...,0)}, where e, is the 
measure assigning unit mass to the point x and zero elsewhere. In defining a 
point process on E, we shall use the convention that if a point falls outside the 
state space it does not contribute to the sum. & will denote the usual product 
o-algebra on E modified so that the compact subsets of E are those compact sets 
in R”+! which are bounded away from (0,0,...,0). 

It will be shown that the sequence {J,,} converges in distribution to a Poisson 
process defined as follows: Let 


OS OO ie 
Y em, er eee > eam 
a ae foe 
be A iid Poisson processes on R \ {0} with intensity measure given by 


\( dx) ai apt Toal) dx T ağ =x) T h-o ol) dx, 


where = p? + (1 — p% and ĝ = 1 — p. Further let LP_, £&,% also be a Poisson 
process on R \ (0) independent of the h le processes above with intensity 
A(dx) = apx "lo (x) dx + ag(—x)~*"'1,_ ,,. (x) dx. The limit point pro- 
cess is then 


wo h 
[ = y > Eo e,? 
kel tm 

where e, € R*+! is the basis element with ith component equal to one and the 
rest zero. In other words, the points of J are located on the coordinate axes, the 
points {jj}, k = 1,2,...} lying on the axis determined by e,. 

In order to establish I, = I it is convenient to first specify a class of sets (as in 
Section 2 of Davis and Resnick, 1985) which generate £. Let S be the collection 
of all sets B of the form 


B =(bp, Col x(b,, c] -DE X (br Chl, 


which are bounded away from (0,0,...,0) and b, < c„ b, + 0, c, #0 for 
r= 0,1,..., A. It is clear that S is a DC-semiring (cf. Kallenberg, 1976, page 3). 
Moreover, since B € S is bounded away from zero, either 


(C1) Bo {ye yER} =o fori=0,...,hA, 
or 
(b,c, i=j, 
(C2) BNA {ye: yER} = IO i: 
$, LEJ. 


That is, B has either empty intersection with all of the coordinate axes or 
intersects exactly one in an interval. Note that in (C2), b, < 0 < c, for i #7 and 
0 €(b,c,]. Further properties of these sets are developed i in following proposi- 
tion. 


Wf, 


b44 R. DAVIS AND S. RESNICK 


PROPOSITION 3.1. 


(i) nP(a;'Y, € B) > 0 if B € S satisfies C1. 
(ii) nP(a; `Y, € B) > Alb), co] if B E S satisfies C2 with j = 0, 
-> À(b,, c,] if B € S satisfies C2 with j + 0. 
Gii) nP(az'¥, € B a; Y, € B,) > 0 if B, and B, © Sand1<t<1+h. 
(iv) n°P(a;'Y, € Ba, 'Y, € B3) < C for all n and t> 1 + h where C is a 
constant depending only on the sets B, and B, in S. 


ProorF. (i) Setting x* = |b| A |eg| > O and y* = |b| A |c,| > 0, we have 
nP(a;'Y, € B) < nP(|Z,| > a,x*,|Z,Z,| > &,y*) 
< nP(|Z,| > a,M) 
+nP(|Z\| > a,x*,|Z,Zo| > G,y*,|Z,| < a,M). 
From (1.2) and (3.1) we have nP(|Z,| > a M) > M~* as n > œ, which can be 
made arbitrarily small by choosing M large. The second term is bounded by 


a, y* g a, y* 
nP||Z\| > a,x", |Z > —-7 | < nP(|Z,| > a,x*)P||Z,| > a. M 


> (x*) *-0=0 
since d,/a, > œ% by (3.5). 

(ii) Suppose 7 = 0. Then, with x* = |b| A |col, y* = min; <,<” (18) A le) > 
0 and using an elementary bound, we have 
InP(a;,'Y, € B) — nP(a,bo < Z, < anco)| < MAP(|Z,| > a,2*,|Z,Z,| > G,y*), 
which goes to zero as n ~ œ by the proof in (i). Moreover, it follows from (1.2) 
and (1.3) that nP(a,b) < Zi < a,C)) > ÀA( bo; Co]. The argument for the case 
J # 0 is handled in the same manner and is omitted. 

(iii) If either B, or B, satisfies C1, then we are done by (i). So suppose B, and 
B, satisfy C2 with B, N e, = (b™, c®] # 6, Ba N e, = (bP, cP] # p. Then if 
J+0and 7’ #0, 

(3.6) nP(a; Y, & B, a} 'Y, = B,) <s nP(|Z,Z, ,., > a,x*, A > a,y*), 
where x* = |b) A Je] and y* = bP] A |c?|. Now if f# 1+ 7 and t+j’# 
1 + J, then by independence 
nP(\Z,Z, ,., > a,x*, [ZZ > a,y*) 
= nP(|Z,Z, 4l > &,x*)P(\Z,Z,,,,,| > a,y*) 


-> 0. 
On the other hand, if t= 1+joré+ 7’ =1+,), then we have the bound 


nP(|Z,Z,| > &,x*,|Z,Z3| > @,y*) 
< nP(\Z,| > a,M) + nP(\Z,Z.| > &,x*,|Z.Z5| > a 9", |Z < a,M) 





a, 
< nP(|Z,| >a,M) + nP(\Z,Z.| > a,x" )P|iZa > a Ti 
n 


=- M" asn> oo, 


LIMIT THEORY OF MOVING AVERAGES 545 


where we have used (3.5) in he second term. Since M is arbitrary the left side of 
(3.6) must have a zero limit. The other cases 7 = 0 or J’ = 0 are done in a similar 
way. 

(iv) This follows easily from (i) and (ii) since for t > 1 + A the vectors Y, and 
Y, are independent. O 


PROPOSITION 3.2. Let {Z,} be iid satisfying (1.2) and (1.3) with 0 <a < 2 
and suppose E|Z,|* = œ. If a, and à, are given by (3.1) and (3.4) we have 
>l 


in the sense of convergence of point processes on the space E (cf. Kallenberg, 
1976). 


PrRooF. Since the point process I is simple, it suffices to show by Theorem 4.7 
in Kallenberg (1976) that 


(3.7) EI(B)—> EI(B)<o forall BES 
and 
(3.8) P(I,(R) = 0) > P(I(R) = 0) 


for all sets R which are a finite union of disjoint sets in S. 

Clearly (3.7) is automatic from (i) and (ii) of Proposition 3.1 because J has all 
of its points on the coordinate axes. Now suppose R = U/L, B, is a union of 
disjoint sets in S. For a fixed positive integer k, define I% (R) = 
rhe¢*le. .y(R) where [x] is the greatest integer < x. Using a Bonferroni-type 
inequality, stationarity, and the disjointness of the sets B,, we have 

m m m [n/k] 

Y [n/k]P(la; Y, = B, | -— >; 3 X [n/k]P(a;'Y, e B,„a,'Y, = B,) 

J=l 


im] jal t=2 


< P(I*, a(R) > 0) < 3 [n/k]P(a;'Y, € B). 
j=] 
It follows from above that 
y [n/k]P(a;'Y, € B) = El% (R) > k'EI( R) 
jm 
as n — œ. Applying Proposition 3.1(ili) and (iv), we also have 
lim sup E [ae Plas'y € B „a; 'Y, € B, ) =o(1/k) ask —> œ 
n> fe 
for i, J= 1,..., m, so that 
1~k'EI(R) < lim inf P( I% 4) R) = 0) 


(3.9) 
< limsupP(I* a(R) = 0) < 1 — kEI( R) + o(1/k). 


row 


546 R. DAVIS AND S. RESNICK 


Since the vector-valued process Y, is A-dependent, a standard argument (cf. 
Leadbetter, Lindgren, and Rootzén, 1983, Chapters 3 and 5) gives 


(3.10) P*(I* 4,(R) = 0) - P(U,(R) =0) > 0 asn> œ 
for every positive integer k. Taking the kth power of (3.9) and using (3.10), we 
obtain 


(1 — k-EI(R))* < liminf P(I,(R) = 0) < limsup P(I,(R) = 0) 


< (1 — 7 EI(R) + o(1/k))”. 


Now letting k > œ, we have P(I,(R) = 0) > e~*4*), But I is a Poisson process 
so that e7 #/(*) = P(I( R) = 0) which verifies (3.8) as desired. O 


THEOREM 3.3. Let {Z,} be iid satisfying (1.2) and (1.3) with O < a < 2 and 
E\Z,|* = œ. Then, if a, and ã,„ are given by (3.1) and (3.4), 


n n n 

an” be Zi: Gi," 2: (ZZ ~~ hd jain, an” 2 (Zilia g Hn) 

t=] t=} t—1 
= (Sos Sis- Sh)» 

where p, = EZ,Zolo7 zasa] and So, Sy- --, Sh are independent stable random 

variables; S, is positive with index a/2 and S,, S,,..., S, are identically distrib- 

uted with index a. 


Proor. Adapting the argument used in Section 2 of Resnick (1986) and in 
Section 4 of Davis and Resnick (1985) (see also Resnick and Greenwood, 1978) it 
is easy to show, for any 0 < 6 < 1, 


n 


n 
a,’ y2 Zilqz,> a,j)’ a," » (ZZ dqz,z,..1> à] 
t=] 


t=] 
AA EE E P i<i<h 


= (Sè, S?,..., 82), 
where 
ene 
Sò 7 XP) le> 8) 
and 


SP = DIP lyse) > sA(ds) 
k=l ee?) Lond 


for 1 = 1,2,..., A. Clearly, Sè, PN o: are independent since the points 


{7}, (J), -.., {Jf} are independent. The Itô representation implies S? = S, 
as § — 0, į = 0,1,..., A (cf. Resnick, 1986) where the vector (S, S,,..., S,) is a8 


LIMIT THEORY OF MOVING AVERAGES 547 


described in the statement of the theorem. In view of Billingsley (1968, Theorem 
4.2), the proof is complete once we show 


(3.11) an imsupE| a, 2 Aiaia =0 
= t= 1 


n —> © 


and 


(3.12) lim lim sup var as Lae ditnict = 0, i ere T 

>00 nw t=1 
The expectation in (3.11) is equal to n/a? EZf1y2)<a,5; Which has the desired 
limit by Karamata’s theorem (Feller, 1971, page 283). Since the process {Z,2Z,,,,, 
t= 0,+1, +2,...} is dependent, (3.12) holds by the comment on the top of 
page 266, Davis (1983). 0 


REMARKS. (1) If the distribution of Z, is symmetric then so is the distribu- 
tion of Z,Z,,,, in which case p,, = 0. 

(2) For 0 <a<1, the theorem remains valid without centering the terms 
Ziri BY Hy. 

(3) In the case 1 < a < 2, EZ,Z, = (EZ,)* exists and from Karamata’s theo- 
rem, nã? E(Z,Z.) — bn) = nai, *E(Z,Zo1112,2,)> a) —> const. Thus, by the con- 
vergence of types result, Theorem 3.3 is also valid if u „ is replaced by p? = (EZ,)?. 


4, Sample correlation function of {X,}. As before let {Z,} be iid satisfying 
(1.2) and (1.3) with 0 < a < 2, E|Z,|* = œ, and define 


(4.1) X= 2b Clt- 
JER 
where 
x ae § = 1, ifa>l 
(4.2) 2 le |] <œ% with a <8<a ifa<l. 


We shall first concentrate on the unadjusted sample correlation function defined 
by 


ve C(h) hei 
(4.3) ô( )= ‘C(0) ’ zU, 
where 
(4.4) C(h) ve E X Xan 


f=] 


The sum in (4.4) is terminated at n rather than n — A for notational simplicity 
in the following arguments. All of the results in this section, however, remain 
valid if the upper limit is n — h. Put p(h) = L,c,c,,,/L£, cf, which in the case 
that var(Z,) < oo, is equal to corr(X,, X,,,). In Davis and Resnick (1985, 
Theorem 4.2) it was shown under condition (1.4) that 6(h) >p p(h). Here, we 


548 R DAVIS AND S. RESNICK 


consider the limit distribution of 6(/), suitably normalized. We begin with the 
following proposition which is similar to Lemma 8.4.3 in Anderson (1971). 


PROPOSITION 4.1. Assume (4.1), (4.2), and E|Z,|* = œ. Then for every posi- 
tive integer h, 


a, vise = p(h) a [C(0)] i 3 L elh ii ep(h))Z,_,2,_, 


felis 
my 





>p 0, 


where a„ and G,, are given by (3.1) and (3.4), respectively. 


PRooF. We have 
p(h) — p(h) = [CO] (CCR) — p(A)C(0)) 


ny 


ale L [È AE AT ~ ph) Taa] 
tog tod 


f=} 


v= [C(0)] e 2 2; cles = c,p(h))Z,_.2,_,. 
few lor y 
so that the difference in (4.5) is equal to 
a, aZ(C(0)] E E (cca — e?o(h))Z?., 


f=] 2 


= ã; 'a? [C(0)] eo ~ c?p(h)) ue, 





= ALCO] E (escuan A)| YZ + Una} 
t fom] 
where U, , = fli Z — Lp, Z? is the sum of at most 2i random variables. 
Since a> *C(0) converges in distribution (Theorem 4.2 in Davis and Resnick, 1985) 
and L(c,c,,, — ¢7e(h)) = 0 it suffices to show 
8/2 
(4.6) lim sup B\E (cites T c?p(h))U, , <0, 


n -> 0 





ô defined in (4.2). Because ô < a, E|Z,|° < œ, so that by the triangle inequality 
and assumption (4.2), we have 


E! ACA = c*o(h))U, , 


< Vi(le,e,4,°7 + lel lo( AP? EIU, 7 


l 


< $ (le, a? + lel le (AA 2E) 


8/2 





LIMIT THEORY OF MOVING AVERAGES 549 


and by the Schwarz inequality this is bounded by j 
„N2 5 .\'? : . 
< 212,"|(Clel il) (Eleni) + to) (Zeit) 
< 0 


by assumption (4.2). Thus (4.6) follows since the bound does not depend on n. O 


PROPOSITION 4.2. Assume (4.1), (4.2), and E|Z,|\" = œ. Then 


n x n 
az? (C0) “es 2; 2 ot. ai a 2, D CC ljr >p 0. 


foul 1m — iml t,7 
it} 


Proor. The proof of Proposition 2.1 can be adapted to this case but a 
simpler argument is given here instead. Choose 0 < 6 < a satisfying (4.2) with 
a < 26. The triangle inequality gives 


n 
= 8 —28 8 3 
Ela, 2 2 C,C,4,_,L;_,| = a, n » lc,c,| E£\Z,2,| 
Aoi E P t7 
r} tty 


2 
< na;™|( De") (E12). 


Now since a, is regularly varying with index 1/a, a?° is regularly varying with 
index 28/a > 1, and hence naz?’ > 0. O 


Rearranging the terms in the sum (4.5), we have 


n 


2: 3 C (Cin = co(h))Z Z, 


f=l hJ 
LEJ 
(4.7) => els — c, P(h)}Zi Zi- 
£=] ıı y#O 
DOR 2 ee Pee fee 
yeOtH=1 4 


where 4, , = ¢,(¢,_,4, — ¢,_,0(/)), i= 0,41, +2,.... J= £1, 42,.... 


PROPOSITION 4.3. Assume (4.1), (4.2) and E\Z,|* = œ. Asn > œ we have 


(i) az > AE T Ly, + a ee 


fo] 


= DON 25 ince) 2 ZA >p 0 
t i=] 


550 R. DAVIS AND S. RESNICK 


for each j > 0 and 


i fl 
(i) az? 2 2 a, 
t {=} 


f=] 4 


and therefore a}? (C(0) — X, eX? Z?) >p 0 


PRooF. (i) Interchanging the order of summation and regrouping terms, the 
difference in (i) becomes 


A—-! 


a,'D4.,| 2 Lily” E22, 
i fom | 


f—1-1 


n-in-]} 


+a Ew- 2 Zili 22; 


i=l i-; {=] 





= äp 2M, ick + a Wis 
where 
Vinci = Zz ZZ tty x Z Let 


and 


Wia = D Zli) ~ a ZZi 
f=] 


f=l-t-y 


However with 6 as chosen in (4.2) 


L,J 








8 
nt} < limsup Uy, ELV, a? 


n 


$2), lile IZ]? < o, 
I 


whence ã; 'È, y 
which proves (i). 

(ii) The above argument also works in this case but with ô replaced by 6/2. 
The last statement follows from Proposition 4.2. 0 


p 0. The same argument also gives G,'Y,w, _,W,., >p 0, 


eee 


THEOREM 4.4. Suppose X, = L7__,.c,Z,_, where {c,} satisfies (4.2) and 
{Z,} satisfies (1.2) and (1.3), and EZ | = 0,0<a < 2. Ifa, and à, are given 


by (3.1) and (3.4), then for any positive integer l, 
(4.8) (@,'a?(A(h) — p(h)-— dy n/C(0)), 1 < hsl) > (Y, Yz.. Y) 


LIMIT THEORY OF MOVING AVERAGES 551 
in R', where 


dp n= (ana) eee J) — 2p( j)o(h)) Le EZ Zlyzzisa p 


=l 


= ¥ (olh +j) + plh - j) - 20( /)o(h))S / So» 

j=l 
and S, Sı &,... are independent stable random variables as described in 
Theorem 3.3 (i.e. Sy is positive with index a/2 and S,,S,,... are identically 
distributed with index a). In addition, if either 


(i) O<a<1, or 
(ii) a = 1 and the distribution of Z, is symmetric, or 
(iii) 1 <a <2 and EZ, = 0, 


then (4.8) holds with d} „= 0, h=1,...,1, and a location change in the S,’s, 
Ji 


Observe that since both a, and ĝ,„ are regularly varying with index 1/a, the 
normalization a2/@,, is also regularly varying with index 1/a. That is, a*/a, = 
n'/*L(n) for some slowly varying function L. 


Proor. From Proposition 4.3, Theorem 3.3, and the continuous mapping 
theorem, we have for any fixed positive integer m, 


[a,%0(0), a È È (Lv, Zi- Zi- - +») 


O<lyiam f=] 


(4.9) 


~ (Ee, 5 Ë Elt yS], 


j=l t 


where p, = EZ,Z,1 177, <4,)- The dependence of y, , on h is temporarily sup- 
pressed. The plan of the proof is to first show that (4.9) remains valid with m 
T by oo and then make use of Propositions 4.1-4.8 to derive the weak limit 
of G,'a,(A(A) — p(h) — dy, ,/C(0)). 
To establish the limit in (4.9) with m replaced by oo it suffices to show (cf. 
Billingsley, 1968, Theorem 4.2) that 


(4.10) lim limsupP| @ 


nt 


Ns £ Lv liali EnF 











n> x |> m iml : 
for every y > 0 and 
(4.11) Ded Vig SS LA y o 


j=l jel ı 


The limit in (4.11) can be checked using characteristic functions since 


552 R. DAVIS AND S. RESNICK 


rh Cp, , + p, -,)|* < œ. As for (4.10), we have the bound 


P a, 2 2 D AN = Hp) > r! 


>m tml ot 























n 
< P aw! > 2 V A AA Zi js 4, ] — Hn) > vA 
jj>m t=] 
| 2 2 Lib, Ji- E +L, ER a | ad 7% 
[yj>am tert 
= Á + B. 


Applying Chebyshev’s inequality to A gives, after some simplification (see the 
proof of Proposition 2.1), 


noon 
Á < 4y Ar 3 Ds L 2 » We Mra, yl + RTTE 


am] ¢=1 4 |j>m];]>m 


2 
ga WW peray yl T Preiera plo; ’ 


where o? = E|Z,Z,)71, ZZ, zà] & Change of variables in the summation gives the 
bound 


As 4y” “a, ĉn Y D D > We Mee yl 


fer oc rax | >m >m 


2 
+IP; yi T Wirral + TETE ae 9 Lune 


and since Lin, Weak, | = Ere -a [Yr | for all integers k, 
= 2 
As ayana 2 Ml O; 
i=- yx |j >m 


The absolute summability of the c,’s ensures that all 4 ne above sums involving 

Iy, | are finite and in particular BMS cis sane a Veg) m Ue Thus by 

Karamata’s theorem (7? no? > a/(2 — a)), we have lim lim sup, ., A = 0. 
With 6 as given in (4. A 


B<2y Grn D} Liv, J EIZ: Z’, Z> än] 


J| >m ıt 


m — x 


and again by Karamata’s theorem, nã; E|Z,2,) lyz 71> a} 7 #/(a — 8) so that 
lim lim sup, ~ B = 0, which establishes (4.9) with m replaced by oo. 


m S 


Now from Proposition 4.1 and (4.7), we have 
azl p(h) i p(h)) = ã ‘a ai n(C(0)) p? 2 Ly, Jii Zi- yy t O o,(1). 
yeQ t=] : 
Since 


Dlt tyne = e(h +) + polh- j) - 2plj)elh), 


LIMIT THEORY OF MOVING AVERAGES 903 


we then have 


a, 'a*(p(h) = p(h) T da, ,/C(0)) 
= ã; anl C(0)) ~ Sap Di Zio =n) t o,(1). 


yeO fal 2 


It follows by applying the continuous mapping theorem to (4.9) that 


G7 'a2(p(h) — p(h) — d, ,/C(0)) = 3 De ec) S/ Leis 


Jl i 
= Y,. 


The proof of the joint convergence in (4.8) is essentially the same as the above 
argument. The only difference is that the vector in (4.9) is extended to an 
l+ 1 — dimensional vector where the (h + 1)th component is given by 


a, 2B ` LYZ- Sy i+] = th) h a 1,2,... L. 


O<jy[<em iml if 


Finally, the last statement of the theorem is an immediate consequence of 
Remarks 1-3 in Section 3. O 


In the following two results, we consider the limit laws of the mean corrected 
version of the sample correlation function defined by 


n n 


p(h) = » (X,— X)(X,.,- X) 2 (X,-X)’, 


f=] fez] 


where X = 27, X,/n. 


COROLLARY 1. Suppose 1 < a < 2. Then for any positive integer L, 
(a; 'a2(5(h) — p(h)),1 < h <1) >(¥,%,--5%). 


PROOF. Since the function p(/) is location invariant, we may assume without 
loss of generality that EZ,=0 (otherwise consider the process X,- EX, = 
dijm— x C AOA 7, EZ- )). in view of Theorem n it suffices to show 3(h) E 


p(h) = 0,(@,a;,”). Using the identity £”_, XZ — E? (X, — X) = nX?, we have 
(4.12) p(h) — p(h) = p(h)nX? = XE Xl D (Ae Xy. 
t= l t=] 


In Section 4 of Davis and Resnick (1985), it was shown that E? (X, - XY = 
O,(a7), Eoi X, = Oan) = 0,(4,), and p(h)->p p(h). Since X > EX, = 0 
and 27, X,,;,/n > EX, =0 as. by the ergodic theorem, this implies (h) — 
p(h) = 0,(&,a, 7) as desired. O 


In the 0 < a < 1 case, the sample mean plays a dominant role in determining 
the limit distribution of p(/A). In order to describe this result, it is necessary to 


554 R. DAVIS AND §. RESNICK 


first define two random variables. Let {j,: k = 1,2,...} be the points of a 
Poisson process on R\ {0} with intensity A(dx) = apx“ 'l o x(x) dx + 
ag(—x) “ 'l. „ o(x)dx, where p and q are given in (1.2). Now if 0 < a < 1, 
then L7_,|J,| < œ as. so that the random variables S = X7, Ją and S, = 
LZ, Jj, are well-defined. In particular, S and S, each have a stable distribution 
with index a and a/2, respectively. 


COROLLARY 2. Suppose 0 < a < 1. Then for any positive integer l 
2 
(n(A(h) ~ (A), sh 1) = (lolh) 1,15 <1(Le] s#[( Xess), 


REMARK. Some properties of the distribution function of S*/S, are studied 
in Logan et al. (1973). See also Cline (1983). 


PROOF. Let {7,} be the points of a Poisson process as described above. Using 
an argument similar to that given in Section 4 of Davis and Resnick (1985) (see 
also Resnick, 1986, Section 4) it is easy to show 


on Eat È (x-2) > (Ee) Èa) (Ee) È) 


(4.13) t=] t=1 
= ((Za)s.(Le2)s)) 
Now rearranging the identity in (4.12), we have 
n(p(h) — p(h)) = n(A(A) — p(h)) 
(4.14) MBA) = HnX?  nX(E(X, = Xna) 
Erl X- X) ESK 
By Theorem 4.4 the first term is O,( G,a;,°n) = 0,(1) since a < 1. The third term 
in (4.14) is also negligible because nX = O,(a,), (L7_,(X,- XY) = Ola”), 


and Ea D — X,4,) = 0,1) so that the product of the three terms is 
O,(a,,') = 0,(1). As for the middle term, 


n(p(h) - 1)nX* _ (o(h) - DE, 6,8)” 
SXT (Z, ¢7) Sy 
follows from (4.13) and the weak consistency of ô( A). Finally the joint conver- 


gence in the statement of the corollary is clear. O 


We close this section with a comparison of the standard result for the 
correlation function in the finite variance case and Theorem 4.4. Assuming that 
Z, has a finite variance and a zero mean, Theorem 8.4.6 of Anderson (1971) gives 


ni/*( (1) — p(1), (2) — p(2),..., ACL) — p(t)) = (Vi Vas- Vi), 


LIMIT THEORY OF MOVING AVERAGES 555 


where the limit vector has a multivariate normal distribution with mean zero and 
covariance matrix given by Bartlett’s formula 


wa i Grae) Bale =p Aa a) 


—2p(j)e(h)e(g +j) + 207( J)e(g)e(A)). 


However, by checking covariances the components in the limit vector may be 
written as 
(4.15) V= 2 (p(h+ 7) + o(h—-J) - 20(s)e(A))S, A=1,2,...,0, 

J=1 
where {S,} is a sequence of iid N(0, 1) random variables. This corresponds to the 
numerator portion of the limit in Theorem 4.4 with a = 2. In fact, S, may be 
identified as the weak limit of o` °n 7 Epi Z,Z,,, J = 1,2,... . Moreover, in 
the finite variance case, the sample variance 


n oO 
no Xp >. 2 c;var(Z,) > 0, 
t=] Dotted ©! 

whereas a,°L7.., X/ > L°_,, cS) in the 0 <a <2 case. This phenomenon 


accounts for the division by S, in Theorern 4.4 and not in (4.15). 


5. Examples. In this section, we consider applications of Theorem 4.4 to 
some time series models. Throughout this section, assume the hypotheses of 
Theorem 4.4 are met and, for simplicity, suppose the distribution of Z, is 
symmetric and that the distribution of |Z,| is asymptotically equivalent to a 
Pareto. It then follows that 


(n/log n)'/"(p(h) — p(h)) 


=F beU = hy ee wOOISS, 


jm 


(5.1) 


and S,, S.,... 18 now an lid sequence of symmetric a-stable random variables, 
independent of the positive a/2-stable random variable Sp. 

The numerator of the limit in (5.1) is also a symmetric a-stable random 
variable with characteristic function given by 


(5.2) ep s peee TOROK? 


J= 


Extending the notion of variance for a Gaussian random variable, Stuck (1978) 
defined the dispersion of a random variable with characteristic function (5.2) by 


(5.3) disp = £ lolh +j) + p(s — h) — 2p(j)p(A)I*. 


j=l 


(See also Cline, 1983.) The limit in (5.1) is then equal in distribution to 


556 R. DAVIS AND S. RESNICK 


(disp)'/*S,/S). Notice that upon setting a = 2 in (5.3), we get this asymptotic 
variance of 6(/) in the traditional finite second moment setting. 


5.1. MA(q). Suppose (X,} is the finite moving average 
K= 2 PO Zra e242 0 Zu: 


Then, since p(h) = 0 for |A| > q, we have for h > q 


(n/log n)" (ô(h) — p(h)) = 





q l/a 

1+2% p0) S/S. 
j=l 

5.2. Estimation of 0 in a MA(1). For the MA(1) process X, = Z, + 6Z,_.,, 

p(1) = 6/(1 + 6). A method-of-moments-type estimator for @ is found by solv- 


ing the latter equation for 6. Choosing the solution with the constraint |6| < 1 
(cf. Fuller, 1976) gives 


(1 ~ (1-48) 28), ifo < |ô] < 05, 
=“\=-1, p< —0.5, 
1, p > 0.5, 


where 6 = p(1). Letting g(p) denote the inverse of the function 6/(1 + 07) with 
i8] < 1, we have by the mean value theorem 


ô — 6 = g(6) — glp) = e(p)(b— p) + 0,(p — p). 


Hence 


(n/log n)'/"(6 ~ 8) = (1 — 67)" *(1 + 6?)"((1 — 297(1))* + pae) S/S: 
The dispersion of the numerator of the limit simplifies to 


(1 + 64)" + jaj%(1 + 07)" 
(1-8?) 


By setting a = 2, we obtain the asymptotic variance of 6 (cf. Fuller, 1976, page 
343). Note that while this estimate of @ is inefficient in the finite variance case, 
its performance in the 0 < a < 2 case seems to be good. For example, in a 
simulation experiment, 100 replications of the process Z,—- 0.4Z,_,, t= 
1,2,...,100, were generated where Z, is Cauchy distributed. The mean of the @’s 
was — 0.40074 with a standard deviation of 0.0790. This compares favorably with 
the asymptotic standard deviation of ((1 — @7)/n)'”* = 0.0917 for the maximum 
likelihood estimator of @ in a MA(1) model assuming the noise sequence is 
normally distributed. While comparing variances in this situation may be a bit 
misleading, it nevertheless gives an indication of the reasonably good perfor- 
mance of 6 in the 0 < a < 2 case. For some comparisons in AR( p) models see 
Bloomfield and Steiger (1983). 


LIMIT THEORY OF MOVING AVERAGES 597 


5.3. AR(1). Let {X,} be the AR(1) process X, = ¢X,_, + Z, where |$| < 1. In 
this case, p(h) = $!”! and estimating ọ by ọ = ĝ(1), we have 


se i/o 
(n/log n)'"(6 ~ 6) = | Z (P7 +7- 20) | S/% 


;=1 


1 — ¢° 
= (= gyn 


5.4. Yule-Walker estimates. The Yule-Walker matrix equation for the 
AR(p) model X, = $,X,., +--+ +¢,X,., + Z, assuming 1 — $,z — be? = 
“++ — G2" # 0; |2z| < 1,15 


(5.4) Ro =p, 


where R is the pXp matrix [p(t-J)]?,.1, $= (p...) and p= 
(p(1),..., p(p))’. The Yule~Walker estimate of > is then defined as the solution 
of (5.4) with R and p replaced by Ê = [6(i—7)]?,.. and 6 = (A(1),..., B(P)Y, 
respectively. As in Yohai and Maronna (1977), for z € R” define ~(z) = R(z) 'z 


) 
where R(z) = [z,_,]?,., and zo = 1. Since k —>p R and R is nonsingular, this 


i, 7 
implies (6) is well defined for large n. The mean value theorem then gives 


6-6 =D(p—-p)+0,6- p), 


where D is the p Xp matrix of partial derivatives of ~ evaluated at p. 
Consequently, 


(n/log n)'/*(@ — 6) = DY, 
where Y = (Y,, Yo,..., Y,y with Y, = 25. ,(p(A +7) + p(h - 7) - 
20( j)e(h))S,/Sy, h= 1., P. 


REFERENCES 


ANDERSON, T. W. (1971). The Statistical Analysis of Tune Seres. Wiley, New York. 

BILLINGSLEY, P. (1968). Convergence of Probability Measures. Wiley, New York. 

BLOOMFIELD, P. and STEIGER, W. (1983). Least Absolute Deviations Theory Applications and 
Algorithms. Birkhauser, Boston. 

CLINE, I). (1983). Estimation and Linear Prediction for Regression, Autoregression and ARMA with 
Infinite Variance Data. Ph.D. thesis, Department of Statistica, Colorado State University. 

Davis, R. A. (1983). Stable lumits for partial sums of dependent random variables. Ann. Probab. 11 
262-269. 

Davis, R. A. and RESNICK, S. (1985). Lamit theory for moving averages of random variables with 
regularly varying tail probabilities. Ann. Probab. 13 179-196. 

DuUuMOUCHEL, W. (1983). Estimating the stable index a in order to measure tail thickness. Ann. 
Statist. 11 1019-1036. 

EVANS, J. ©. (1969). Preliminary analysis of ELP noise. Technical Note 1969-18, DDC AD691814, 
MI'T Lincoln Laboratory. 

FAMA, E. (1965) The behavior of stock market prices. J. Bustneas 38 34-105. 

FELLER, W. (1971). An Introduction to Probability Theory and Its Applications 2, 2nd ed. Wiley, 
New York. 

FULLER, W. (1976). Introduction to Statistical Time Serves. Wiley, New York. 


558 R DAVIS AND S. RESNICK 


HAAN, L. DE (1970). On Regular Variation and tts Application to the Weak Contergence of Sample 
Ea tremes. Tract 32, Mathematics Centre, Amsterdam. 

HALL, P. (1982) On some simple estimates of an exponent of regular variation J Roy. Statist. Soc. 
Ser. B 44 37-42 

HANNAN, E J. and KANTER, M. (1977). Autoregressive processes with infinite vanance. J. Appl 
Probab. 14 411-415. 

KALLENBERG, O. (1976) Random Measures. Akademie, Berlin. 

KANTER, M and STEIGER, W L (1974). Regression and autoregression with infinite valance. Adh 
in Appl. Probab. 6 768-783. 

LEADBETTER, M R., LINDGREN, G. and ItooTzeEn, H. (1983). Ext emes and Related Properties of 
Random Sequences and Processes. Spunger, Berlin. 

LOGAN, B., MALLOws, C., RICE, S., and SHEPP, I. (1973). Limit distributions of self-normalized sums. 
Ann. Probab. 1 788-809. 

MERTZ, P (1965) Impulse noise and erro: petformance in data tranamission. Memo ItM-4526-PR, 
DDC AD614416, Rand 

REEVES, P. (1969). A non-Gaussian turbulence simulation, Technical Report AFFDL-TR 69-67, Au 
Force Fight Dynamics Laboratory. 

RESNICK, S. (1986). Point processes, regular variation and weak convergence. To appear in Ach an 
App! Probab. 

RESNICK, S. and GREENWOOD, P. (1978). A bivariate stable characterization and domains of 
attraction. J. Multivariate Anal. 9 206-221. 

RyYBIN, A K. (1978). Effectiveness and stability of estimates of the parameters of a strong signal in 
the case of non-Gaussian noises Eng. Cybernetics 16 115-129. 

SAFIULLIN, N Z. and CHABDOROV, S. M. (1978). Transformation of non-Gaussian random processes 
by radio devices.. Telecommunicatons and Radio Engineering 32 114-116. 

SENETA, E. (1976). Regularly Varying Functions. Lecture Notes in Mathemahcs 508 Springer, 
Berlin 

Stuck, B W. (1978). Minimum error dispersion linear filtering of scalar symmetric stable processes 
IEEE Trans. Automat. Control AC-23 507-509. 

Stuck, B W and KLEINER, B. (1974). A statistical analysis of telephone noise. Bell System Tech. J. 
53 1263-1320. 

YOoualI, V and Maronna, R. (1977). Asymptotic behavior of least-squares estimates for autoregres- 
sive processes with infinite variances. Ann. Statist. 5 554-560. 


DEPARTMENT OF STATISTICS 
COLORADO STATE UNIVERSITY 
FORT COLLINS, COLORADO 8052.3 


The Annals of Statist. 
1446, Vol 14, No 2, 659-H78 


STATISTICAL ESTIMATION OF THE PARAMETERS OF A 
MOVING SOURCE FROM ARRAY DATA’ 


By SHEAN- TSONG CHIU 
Rice University 


This paper ıs concerned with the problem of estimating the variable time 
delays of a signal arriving at an array of sensors. A procedure to estimate the 
parameters of a linear time delay model ıs proposed. The procedure compares 
the Fourier transforms at different frequencies (thereby taking the Doppler 
effect into consideration). Under regularity conditions, the estimate obtained 
is shown to be consistent and asymptotically normal. Simulations were 
carried out and the results were found to agree well with the theoretical 
results. The procedure was applied to the records of the Imperial Valley 
earthquake of October 15, 1979, as recorded by the El Centro differential 
array. 


1. Introduction. The situation we are interested in is that of a signal 
emitted by a moving source, such as the leading point of an earthquake rupture, 
as received by an array of sensors. The particular data we are working with come 
from sensors located near the source of a major earthquake with a lengthy source 
rupture. Ground motion obtained by such sensors is referred to as strong motion 
records. 

Strong-motion seismology is a relatively new discipline. It is however an active 
one because of its importance in understanding the source properties, the wave 
properties in near field, and the effect of strong-motion on engineering structures 
[see, for example, Bolt (1981)]. 

Most earthquakes appear to be caused by faulting. The associated theory, the 
elastic-rebound theory, of earthquakes was first outlined by Reid (1910) in his 
study of the great San Francisco earthquake of 1906. According to the elastic- 
rebound theory [see Boore (1977)], rocks are elastic, and mechanical energy can 
be stored in them as in a compressed spring. When the two blocks forming the 
opposite sides of the fault move by a small amount, the motion elastically strains 
the rocks near the fault. When the stress becomes larger than the frictional 
strength of the fault, the frictional bond fails at its weakest point. (The point of 
initial rupture is called the hypocenter or focus.) From the hypocenter, the 
rupture rapidly propagates along the surface of the fault, causing the rocks on 
opposite sides of the fault to slip past each other. A portion of the elastic strain 
the rocks had stored before the rupture is suddenly released. The rocks along the 


Received December 1984; revised July 1985. 

'This research was support in part by National Science Foundation Grants MCS 80-02698 and 
CEE 79-01642 

AMS 1980 subject classifications. Primary 62M99, secondary 60G35. 

Key words and phrases Time delay, Fourer transform, periodogram, spectrum, coherence, 
Doppler effect. 


559 


560 S.-T. CHIU 


fault rebound to an equilibrium position in a matter of seconds. The elastic 
energy stored in the rocks is released as heat (generated by friction) and as 
seismic waves. The seismic waves are radiated by a moving source, so the waves 
may be expected to show Doppler-like effects. 

The faults of natural earthquakes appear to rupture in quite complex ways, 
and a reasonable representation might be an erratic motion superimposed on a 
generally smooth slip [Aki (1967)]. The ground motion produced by this kind of 
rupture will tend to look like a stochastic process. 

Available information for studying source rupture processes including records 
from seismometers, located either close to or far away from the source. In this 
paper, we study the records from an array of sensors which are close to an 
earthquake source. In general, the model of interest can be described as 


(1.1) X(t) = S,(A(t,r,)) + elt,r,). 


Here X,(t) is the observation recorded by the sensor located at r, S,(t) is the 
signal emitted by the source, and A(¢t,r,) is the time the signal which arrives at 
time ¢ at location r, was emitted by the source. ‘Thus, each sensor receives the 
common signal, but with different time shift, together with the noise, e(¢,r,), on 
that station. 

In the following, we consider the case of two stations. We have the model 


X(t) = S(t) + e(t), 


on X(t) = S(A(t)) + s(t). 


2. Estimation of the time delay for a fixed source. When the source is 
fixed, the time delay between the stations is constant and the model can be 
simplified to 


X(t) = S(t) + e(t), 
X,(t) = S(t + m) + e(t). 


In this case the Fourier transforms of the signals have the relationship, 


(2.1) 


(2.2) d'ra, A) = d7(A Jexp(imA), 
where 

T-1 
(2.3) d7(A) = 2, S(t)exp(—iAt) 


{=0 i 

is the Fourier transform of the signal S(t) at frequency A, and 

T-1 
(2.4) d7(7,A) = }, S(t + rj)exp(—2At) 

t=0 
is the Fourier transform of the delayed signal, S(é + r). With similar definitions 
for d?(A) and d7(A), the Fourier transforms of the observed series have the 
approximate relation 


(2.5) dh (A) = dh (A)exp(idt) + dF (A) — ai (A )exp( iàn). 


TIME DELAY ESTIMATION D61 


This suggests estimating 1, by that 7 which maximizes the criterion 


(2.6) Qr(r)= LL p(w dy (w,)d (—o, Jexp(—irw,). 
la| < 7/2 


Here w, = 27s/T, s = 0, +1,..., are the Fourier frequencies. Since the spectra 
of the signal and of the noise are usually not constant, the function y was 
introduced to put different weights on different frequencies. Hannan (1975) 
proved that, under regularity conditions, such an estimate is strongly consistent 
and asymptotically normal. Thomson (1982) extended the results to vector 
observations. The case of dispersive waves (different frequency components 
traveling at different speeds) was studied in Hannan (1975). We remark that the 
problem of estimating the constant time delay has also been considered exten- 
sively in the study of passive sonar signal processing [see Carter (1981) and 
references therein]. 


3. Estimation of the linear time delay parameters. When the source is 
moving, the time delays between sensors are time dependent. Due to the 
complexity, little research concerning a moving source has been done. The 
problem was considered in Knapp and Carter (1977) and in Schultheiss and 
Weinstein (1979). 

It was shown in Chiu (1984) that, for some pertinent cases, the time delay can 
be well approximated by a linear function. In these cases we have the approxi- 


mate model 
X(t) = S(t) a e(t), 
(3.1) 
X(t) vm S( ap T Bot) + e(t), 


and the problem of estimating the variable time delay is simplified to the 
problem of estimating the parameters a, and £p. In practice, 8, is quite close to 1 
and it seems reasonable to write 8, ' = 1 + c,/T. Then we have 


(3.2) di ((aq, Bo), A) a daz( A/Bo )exp(ta,d), 


where d7((a, 8), A) is the Fourier transform of the shifted signal, Sla + Bt). 
Therefore, the Fourier transform of the second series can be written as 


dy (A) = allao By), A) + aT (A) 
Ss d% (A/ Bo )exp( rag) — di ( A/By )exp(tAay) + d;( A}. 
This suggests consideration of the estimate (â, 8) which maximizes the criterion 


(3.4) Q,(a,B)= 5 yw dh (w,)d 4 (—w/B )exp( — ira). 


jal< 7/2 


(3.3) 


In the following, we only discuss the case of stochastic signals. Similar results 
for a deterministic signal can be found in Chiu (1984). In the next section we will 
state the assumptions under which the estimate is strongly consistent and 
asymptotically normal. 


562 S.-T. CHIU 


4. Assumptions and results. We first describe the model of interest in 
Assumption 1. 


ASSUMPTION 1. Let S(t), — œ < t< œ, be a real valued stationary process. 
Let the observations X,(¢), X(t) have the structure 


X(t) = S(t) + e(t), 
Xo(t) = Slay + Bot) + elt). 


Further suppose that ¢ = 0,..., T — land 8) ' = 1 + ¢,/T. Also assume that a, 
and c, are contained in the interiors of compact sets A and C, respectively, in R. 


(4.1) 


The requirement that (a,, cy) is contained in the interior of a compact set 
seems not to be a strict one, for in usual physical situations it is clearly possible 
to limit the extent of the delay. We next assume the series, e(t), e(t), and S(t) 
to be statistically independent Gaussian processes and to satisfy mixing condi- 
tions. 


ASSUMPTION 2. S(t), e(t), and e,(¢) are stationary independent Gaussian 
processes with mean zero. Also suppose ).(1 + |u])|c,(w)| < œ, where c(u), t= 
1,2, are the autocovariance functions of «,(f), ¿ = 1,2. 


Under this assumption, the noises ¢,(f), 7 = 1,2, have power spectra f,(A), 
zZ = 1,2, respectively. In practice, the sampling interval usually has been chosen 
so that the signal has very little spectral mass beyond ~. Therefore, it seems 
reasonable to assume that the signal has no spectral mass above frequency 7. 


ASSUMPTION 3. Let c(u) = E[S(f)S(t+ u)], ~— 00 < t,u < œ, the autoco- 
variance function of S(t), satisfy 
(4.2) 2% (1+ ol) sup |e(u)|< œ, 
p -= oS t(susetl 
and f(A) = 0 for |A| > a, where f(A), ~œ <A < oo, is the power spectrum of 


S(t). 


Because the signal in practice often has significant magnitude only in some 
frequency intervals, we would like the weighting function to satisfy the following 
conditions. 


ASSUMPTION 4. wW(A) = J,(A)o(A), where $(A) is a nonnegative continuous 
function of A with period 27, I,{A) is here the indicator function of a set 
Q Cc (—,7), Q is a finite collection of intervals, and (A) is symmetric about 0. 
Further suppose /y(A)f,(A) dA > 0. 


The condition, f(A)f,(A) dA > 0, requires the intervals, 2, to contain some 
frequency intervals in which the signal has spectral mass (otherwise, we will not 
be able to estimate the parameters). 


TIME DELAY ESTIMATION 563 


Under Assumptions 1—4, the estimate obtained by maximizing the criterion 
function @,(@) of (8.4) is strongly consistent and asymptotically normal. We 
state the theorems here and postpone the proofs to the last section. 


THEOREM 1. Suppose Assumptions 1-4 hold. Let 6, = ( Ĝr, ĉr) be a se- 
quence of estimates which maximizes Q,(@) of (3.4). Then 0p converges to 
6, = (ay, Cy) almost surely as T > œ. Here B™' = 1 + ¢/T, @ = (a,c). 


THEOREM 2. Under the conditions of Theorem 1, let Îp= (âp, ĉr) be a 
sequence of estimates which maximizes Q,(6@). Then T'? {âr — ay, ĉr — Cy) ts 
asymptotically normal with mean zero and covariance matrix 


(4.3) alg A 


Here a equals 
Daf APPA LOAD AA) + EA] + ALA) } dA 
( (2 AWA) F(A) AY 


From the covariance matrix, we saw that the correlation coefficient between å 
and ĉê is quite high (0.866). This was confirmed in the simulation study. It should 
be noted that Figure 2 is the plot of & and Ê. Since B-' = 1+ ĉ/T, as we 
assigned before, Figure 2 shows a negative correlation between â and Ĝĝ. 


(4.4) 


5. Estimation of the spectra and selection of the weighting function. 
From Theorem 2 we see that the asymptotic covariance matrix of the estimate 
depends on the spectra of the signal and the noises. In practice, however, these 
spectra are unknown and we need to estimate them. After getting the spectrum 
estimates we can estimate the covariance matrix by substituting the estimated 
spectra for the true ones. We discuss a method of estimating the spectra in this 
section. 

We note that 


d% (A) dk, (-A/Bo Jexp( — aod) 
= d7(6,,)d7(@,, -A) + d7(4, A)di( —)/B,y )exp( —ia,A) 
+d7(X)dl(—A/B, )exp( — iad) 
+d7(A)d?(—A/By expl — iad). 


On the right-hand side the expected value of the first term is equal to 2a TFA) 
and the expected values of the second, third, and fourth terms are zero. So 


(5.2) d7 (A)d% (—A/Bo )exp(— tad) 


has expected value approximately equal to 27 Tf,(à). These values are asymptot- 
ically independent at different Fourier frequencies, w, = 27s/T. This suggests 
estimating f, (Àp), the spectrum of the signal at frequency Àp, by averaging the 


(5.1) 


564 S -T. CHIU 


df {9 jd’, (—w,/Bp)exp( — tagw,) at the Fourier frequencies near Ay. Though we 
do not know the true value of 6,, we expect that replacing 6, with Â, the estimate 
of 4, will give us a useful estimate of the signal’s spectrum. We establish the 
following result. 


THEOREM 3. Under Assumptions 1-4, let 0, > 6, and My > œ, M,/T — 0, 
then re — 0, where 


(5.8) AOA.) = 





2 dk {o,)dk,(-,/Br )exp( iaro, ) 
ei w € ly, E i Í 

and lų, is the set containing M, Fourier frequencies which are closest to Àq. 
[ Here Bae =] + Cr/T and br = (ar, Cr).] 


Having an estimate of the signal spectrum, we may proceed to estimate f, (A), 
the spectrum of the noise e,(t), by max(0, fx (Ao) — FOA), where R (A,) is an 
estimate of fy (Ag) = f(A) + f,(Ag), the spectrum of X(t). fe Ag); the spectrum 
of the noise e(t), can be estimated in a similar way. ' 

We have applied a weighting function in computing the criterion function 
Q,(8) of (3.4). We should now discuss a method for selecting a pertinent 
weighting function. Some specific weighting functions have been suggested. 
Knapp and Carter (1977) have reviewed some of these weighting functions. The 
optimal choice of (A) is 


f(A) 
FONDA CA) + AOADEAD + ADEA) 


This was obtained by using a quasi-maximum likelihood procedure [see Hannan 
and Robinson (1973)]. This weighting function minimizes the value of a of (4.3) 
in Theorem 2 among all weighting functions. In practice, however, the spectra of 
the signal and the noises are unknown. This may cause problems when one 
substitutes the estimated values for the true ones, since when the signal to noise 
ratio is low the spectrum estimate of the signal tends to be bigger than the true 
value. This gives too much weight to that frequency. Hannan and Thomson 
(1981) assumed a finite parameter linear model, and then estimated the spectrum 
of the linear model. 

Since the spectrum of the signal often has significant magnitude only in some 
frequency intervals, we would want the summation of (3.4) to extend only over 
these intervals. This suggests an alternative approach, namely, to discard the 
information at frequencies where the signal to noise ratio is low. So we select 
Q = A U(—A), where A is a finite collection of intervals in (0,7), and use a 
weighting function which can be written as (A) = $(A)IQ(A). o(A) is an even, 
positive, and continuous function over (—7,7), and J,(A) is the indicator 
function of Q. 

Intuitively, we would like to select these intervals which have high power from 
the signal and low power from the noise. Under our model it is equivalent to 


(5.4) (à) = 


TIME DELAY ESTIMATION 565 


choosing the intervals which have high coherence. Thus, we can use coherence to 
select the intervals. Brillinger (1975) discussed estimating coherence from the 
periodogram and gave the asymptotic distribution of the coherence estimate. We 
should note that this theory is for stationary series, and the model we discuss 
does not satisfy this condition. However since 8, is quite close to 1 we might still 
get a reasonable estimate of the coherence. 

After getting the estimate of the coherence, we can calculate, for example, the 
90% quantile of the null distribution of the estimate under the hypothesis that 
the coherence is zero. We shall choose those intervals inside which the coherence 
estimate is higher than that quantile. Having chosen the intervals, we select a 
weighting function, for example, by the quasi-maximum likelihood method. Then 
we compute the criterion function Q,(@) and estimate 6, = (a, Co) by 6 = (â, ĉ) 
which maximizes Qr(0). ; 

Now, we reestimate the coherence. This time we substitute dy (~,/B,) for 
d} (%,) in the estimation procedure, as before By! = 1 + ĉ,/T. This should give 
us an improved estimate of the coherence. We then use these estimates to choose 
intervals and the weighting function to get the final estimate of the time delay 
parameters. 

In practice some other considerations may affect the choice of the weighting 
function. The sensors may receive several signals at the same time, and different 
signals might have different frequency contents. For this case if the signal we are 
interested in has low power in some frequency intervals, then we should not 
choose those intervals even when they have high coherence. We also note that the 
estimated coherence tends to be bigger than the true value when both spectra of 
the observation series have small power. Therefore, we should be careful in 
employing the intervals which have small signal power. 


6. Simulation results and application to seismic data. In this section we 
present simulation results to evaluate the performance of the estimation proce- 
dure discussed in the previous sections. The signal used in the simulations is a 
band-limited stationary process. The frequency content is limited in the interval 
(607/512, 1207/512). The signal is 

2000 


3 
(6.1) S(t) = sy È (AUes) + 20(sin(A,)), 


where z,( J), Zo() are independent Gaussian random variables with means zero 
and variances 1, and f 


607 60 jm 
(6.2) 


ÀA, = —— + -—. 
7 512 512- 2000 
We first generate the signals, S(t) and S(a,+ 8t), of length 512 with 
a, = 0.25, By = 1.02. We then form the observed series by adding noise to the 


signals, that is, 
X(t) = S(t) + e(t), 


(6.3) 
X(t) = S(0.25 + 1,02t) + e(t). 


566 S.-T. CHIU 


=e 


-4 


0 50 100 150 200 


Time 
Fic. 1. Sample serves of the sunulation. 


The noises are independent Gaussian white series with mean zero and variance 1. 
Figure 1 is the plot of a pair of sample series. We show only the first 200 points of 
each series. 

The criterion function used in the simulations is 


(6.4) Qr(0) = Re > dx (w dy (—,/B exp( —iaw,), 


gm 29 


where w, = 27s/T and 6 = (a, 8). Figure 2 is the plot of the estimates which 
maximize @,(@) above. Figures 3 and 4 are the normal probability plots of the 
estimates. The result of 75 simulations is summarized in Table 1. 

The estimation procedure was applied in analyzing the accelerograms from the 
El Centro differential array in Imperial Valley. The data are records from the 
October 15, 1979, Imperial Valley earthquake. This earthquake has been studied 
extensively. The information about the array can be found in Bycroft (1982). 

We analyze the records of stations 1 and 4, which are 420 feet apart. Station 4 
is on the north of station 1. We take 512 time points (5.12 seconds), beginning at 
23:17:05.15, from each of the north-south components of stations 1 and 4. 

In order to examine the frequency content of the signal and choose a proper 
weighting function, (A), we first estimate the power spectra and the cross-spec- 
trum. We estimate the spectra and the cross-spectrum by smoothing the periodo- 
grams at the Fourier frequencies w, = 27s/T, that is, we smooth the functions 


Dilo) = d%,(#,)d% (—w,)/27T, 
(6.5) Inla) = d3 (w,)d% (—o,)/20T, 


I,,(@,) g dy (os jd% c= o,)/2aT. 


TIME DELAY ESTIMATION 567 





meon 


Tois 1.017 1 O18 1.018 1 020 1 021 t 022 1.023 
B 


Fic 2. Estimates of the 75 simulations 


Here dî (A), d7} (à) are the Fourier transforms of the data. From the estimated 
spectra, we see that most of the frequency content lies below 0.1 (10 Hz) for both 
series. Figure 5 is the plot of the estimated coherence. It suggests that the 
frequency contents below 0.07 (7 Hz) of these two series are highly correlated. 
The criterion function we used in deriving the estimates is the real part of 


(6.6) > dz (w, dy (—0,/B )exp( —iw,a). 


a=} 


The maximum of @,(@) is located at (— 0.10, 1.0163). Now we reestimate the 





-3 -2 ~] 0 1 2 3 
Standard Dei min 


Fic. 3. Normal probabuity plot of the 75 estimates of ap = 0.25 from the simulations. 


568 S.-T. CHIU 


| O22 


1.0t8 





1.016 


-3 -2 ~f 0 1 2 3 


Standard Dex uton 


Fic.4. Normal probability plot of the 76 estimates of Pa = 1.02 from the simulations. 


TABLE | 
Sunulation Results 


Theoretical Sample 
Mean (a) 0.25 0.349 
S.D. (&) 0.352 0.377 
Mean () 1.02 1 0197 
S.D. (È) 0.00119 0.20123 
Corr. (&, B) — 0.866 — 0.338 
as 
© 
© 
© 
go 
E 
ie 
o 
N 
O 
O 
O 
0.0 D i 0.2 03 04 0.5 


Adla Frequencs in oles per 1/100 sec 


Fic. 5. Estimated coherence of X,(t) with Xb 


TIME DELAY ESTIMATION 569 


Coherence 


9.2 





0.0 


re ne err nn eb ee eee a 


0.0 0.1 02 0.3 0.4 0.5 


A/2r Frequency m aveles per t100 sec 


Fic. 6. Modified coherence of X,(t) with X {t 


cross-spectrum, this time smoothing the modified periodograms 
(6.7) l (w) = d3 (w, )d%(— w,/1.0163)/27T. 


As before w, = 27s/T and T = 512. 

Figure 6 is the plot of the modified coherence. This plot shows correlation only 
for frequencies below 0.07 (7 Hz). In comparing with Figure 5, we see that the 
estimated coherence has risen from 0.83 to 0.95 for frequencies near 0.03 (3 Hz). 
This gives evidence of the existence of the moving source in the data. Because the 
modified coherence does not suggest that frequency components above 0.07 are 
correlated, we will not change the weighting function. The final estimates are 
those above. We then estimate the variances of the estimates by substituting the 
estimated spectra for the true ones in Theorem 2. The variances of & and Ê are 
estimated at 0.0608 and 0.696 x 107°, respectively. Thus the estimate of a, is 
~0.10 with a standard deviation 0.25 and the estimate of 8) is 1.0163 with a 
standard deviation 0.0008 

This set of data has also been analyzed by Spudich and Cranswick (1984). 
They estimated the variable time delays between the stations. The method they 
used is to move a time window of fixed length (the length is 127 sample points in 
their analysis) along the series. For each time window, constant time delay 
estimate was obtained by finding the location of the maximum of the cross- 
covariance function of the filtered series. (This method is equivalent to the 
method we discussed in Section 2.) Then, the time delay estimates were plotted 
as a function of time. Figure 7, taken from Spudich and Cranswick (1984), shows 
the results of the north-south components between station 1 and station 3. A 
variable time delay can be observed from the figure. The vertical unit in the 
figure is slowness; the slowness is obtained by dividing the time delay (in seconds) 
by the distance (in km) between the stations. The period of the data we analyzed 
is from 5.8 seconds to 10.9 seconds, and the major part of the seismic waves 


570 S.T CHIU 


MOVING WINDOW CROSS-CORRELATION, N/S COMPONENT, DAI AND DAS 
[TNery ere aeonit iiini ETTTTTPTESEPT I OEE TEETET PY TTT IY TRS TTT TPT TTT IT TTYL ETT TO 


l 
| PEAK CROSS 
CORRELATION 





0 WINDOW 
A AND 
WEIGHTING 
i DAI NZS ACCELEROGRAM 
i i j S 

Š i i N 

: | 

f ' 

-0.4 
SLOWNESS 

>; ; Ne A 
s $ v a 
o; Vy / \/ N 
o) 
z 
= 0.4 | 
3 --HYPOCENTRAL $ 
a ee 

a | 10 1s 20 

TIME AFTER 59.37, 5 
Fic 7. Moting-uindow cross-correlation of the north-south component accelerograms from station 


l and staton 3 The certical hne indicates the theoretical arrual time of the S wae from the 
hypocenter The upper trace shows the peak value of correlation as a function of time, the muddle 
trace ts the north-south accelerogram, and the lowest ts the slowness as a function of time through 
the record Also indicated are the width and weighting function of the moving window | from Spudich 
and Cranswi k (1984) ] 


arrived in this period. In Figure 7, the slowness changed 0.6 (from 0.3 to — 0.3) in 
this period. This result is quite consistent with ours. From our estimate of 8, we 
find the slowness changed 0.65. Since the common time of these records was lost 
and different time origins were used, we cannot compare the value of & with the 
estimate they obtained at 5.8 seconds. 

For the situation of variable time delays, the moving-window has two main 
disadvantages: (1) The constant time delay estimates are unstable when the 
window length is sm~il. (2) When the window length is big, the time delay cannot 
be approximated by a constant time delay model; therefore, it will be difficult to 
find the staciciicr | properties of the estimates. In this case a linear function, or a 
piecewise linear function, provides a better approximation to the variable time 
delay function. 


7. Proofs. We proceed to prove the results in this section. The following 
lemma allows us to replace d7 (w,/ )exp(iaw,) by d{(6, w,). This makes it easier 
to calculate various cumulants. 


TIME DELAY ESTIMATION o71 


LEMMA 1. Let S(t), —c <t< œ, satisfy Assumption 3 and S(t) = 
(t/T)'S(t), i= 0,1. Then, for any bounded function g(r) and t, j = 0,1, 


E| E g(w,)d2(6,,0,)45(~a,/Ba)exp(—iays,)] 
(7.1) jsi<T/2 


-E| E a(w,)A2(6,,0,)43 (8,,~ o,)} + OCT log) 
\s|< T/2 


uniformly on Q x 8. Here w, = 278/T, B>' =1+¢,/T, and 6, = (a, c,). 
Proor oF LEMMA 1. For any A,,A, © (~n, T) 


E{a™ (6, AJAT (Az) | 


= 2 L (t/T) (t/TY ela, + Bit, — ta jexp( —ià,t, jexp( —iùot3) 
h h 


(7.2) = 2 2 (t,/T )'(to/T )’exp( — it, exp(— id gto) 
i h 
x f exp(in(a, + B,t, — t.)) f(n) dy 
=f" HA, ~ nB) H(A + nyexplinay) film) dn, 
where H(A) = X724(¢/T)'exp(—idt), i = 0,1, 2. So 


E| E a(w,)a2(6,, 0,)A%(—,/B,)exp( —iaye,)| 


l| < T/2 


= E alo) f Hlo,- np )HP(-0,/Ba + nexp(ina) 


ja|< T/2 
(7.3) X exp( —iagw,) f(n) dn 
= E glo) f Hio,- npi)H7- o, + nBajexplina,) 
js} < T/2 TE 
x exp(—iagn) f,(n) dn + off" L [HT o- nĝ:)|dn); 
-#js|<T/2 


The last equality holds because H7(B\) = H(A) + O(1) and exp(iaw) = 
exp{ ian) + O({|w — nì). Note that the first term of (7.3) is equal to 
(7.4) BL E Ww,)A2(6,,0,)A% (8,,~ «,)}, 

(|< 7/2 


and the second term is of order T log T. This completes the proof of the lemma. 
a 


Analogous to the proof of Theorem 4.5.1 of Brillinger (1975), we have the 
following lemma. The details of the proof can be found in Chiu (1984). 


572 S.-T. CHIU 


LEMMA 2. Lete (t), e(t) satisfy Assumption 2 and let 4(À ) satisfy Assump- 
tion 4. Then 


(7.5) im T? } $(,)d7(@,)dT(o/B )exp(iaw,)=0 a.s. 


l 

T= we T/2 

and uniformly on (a,c) E€ O = C x A (C, A compact sets in R). Here B°' =1+ 
c/T and w, = 2ns/T, s = 0,+1,.... 


In proving the strong consistency of the estimate and deriving its asymptotic 
distribution, we need to calculate the limits of 


T? $ plo jd% (0, » O, as ( ~ 0,/ By exp(—iaw,), 
la| < 7/2 


where S(t) = (t/TYS(t) for i = 0,1,2. Lemma 3 gives us the values of these 
limits. 


LEMMA 3. Let S(t) satisfy Assumption 3 and (A) satisfy Assumption 4. 
Let St) = (t/T)' S(t), i = 0,1,2. Then, uniformly on © x ©, and for i, j = 
0, 1, 2, 


(7.6) lim T? È plod (8, w, )d\ (— / By, )exp( — raw, ) = Q,,(9,, 82) 


4 


ee [|< T/2 
almost surely. Here w, = 208/T, B; ' =1+ c/T, @,=(a,,¢,), and 
Ql 82) = f ¥(a)explin(a, = a2) f(n) 


(7.7) 
x fexp[en(c, ~ c,)x}x'¥ dxdn. 
o 


PROOF oF LEMMA 3. By using the same arguments as in the lemma of 
Hannan and Robinson (1973), one can show that 


(7.8) D yllo HT o,- Bm)HT ~o, + Bm) 
laj < T72 


converges to 
l ; 
(7.9) Y(n) f x'explinl e, ~ c,)x) dx. 


Here H(A) = D1Za(t/T ¥exp(—iàt), i = 0,1,2, as defined in the proof of Lemma 
1. From this and Lemma 1, we have 


(7.10) lim r-e] 2 y(w, di (4, w, dt (b, -«,)| = Q (0, 0,). 
T= x ja|< T/2 
Then, similar to the proof of Lemma 2, one can show that 


(7.11) T- E yo )dT(0, w dT (e, —,) 
lalce 7/2 


converges, uniformly, to Q;,(8,, 9,) almost surely. O 


TIME DELAY ESTIMATION 573 
Next we prove Theorem 1 concerning the strong consistency of the estimate. 


PROOF OF THEOREM 1. Note that Q,(@) can be separated into four terms, 
namely, 


Q7(0) = T E plo, dT, w, dT ~ 0 /B expl- rwa) 


+T Epo, dT (w dT- w/B Jexp( — 10,0) 


7.12 
(7.12) +T Eyl, )dT 0a, 0, )dT(- o/b Jexp( — roa) 


+T? E ¥(0,)AT( 0, )d7(— 0/8 exp( ~ 10,2). 


From Lemma 2 the second, third, and fourth terms converge to 0, uniformly on © 
and almost surely. Further, from Lemma 3 the first term converges uniformly to 


exp(in(¢y — ¢)) — 1 
in( cy — c) 


Because (7) and f(n) are symmetric functions, Q(@) is a real-valued function. 

The theorem can be proven by following a classical argument [Jennrich 
(1969)], if we can show that @(@) has a unique maximum at # = @,. 

Since jexp(in(c, — c)) — 1ļ is the distance between 1 and exp(in(c, — c)), and 
In( Cy — c)| is the length of the arc between 1 and exp(in(c, — c)) on the unit 
circle, Q(@) will not attain the maximum when c # ¢y. 

Consider 


(7.13) Q(8) = f” exp(in(ay — a)) Yla) fn) dn. 


Q((a, co)) = f explin(a = a))y(n) iC) dn 
(7.14) j 
= 2f cos(a(ao = a)) y(n) f(n) dn. 


If a, # a, then we can find an interval J contained in the set, {w: Y(w)f dw) > 0} 
such that cos(n(a, — a)) < 1 for n € J. Therefore the only maximum is located 
at 6). O 


Next, we prove Theorem 2 concerning the asymptotic distribution of the 
estimate. 


PROOF OF THEOREM 2. When T is large enough, the estimate 6 will be in the 
interior of ©, and dQ,(@)/da = 0Q,(8)/dc = 0. Therefore, we can find points 
6’ and @” between @ and 6,, such that 


Irl 4) 8°Q7(9’) —-0?Q,( 8’) 
; Ja ga da ða dade & — By 
Te 5 x T 32 == T 2 ” e 


de dade dc dc 


574 S.-T. CHIU 


Firstly, we should find the asymptotic distributions of 


g 8 
(7.16) Sita plea dh hod; o/B)erpl- ia) 
a ls|< 7/2 
and of 
g G 2 
an Sy tad, )09(0,)4(—0,/B exp\ — a, ) 
C Ij <T/2 
Here 
T-1 t 
(7.18) x(A) = 1 palel 22), 


f= 


and as before 8™' = 1+ c/T. [We define dA) and d if similarly. | 
From Lemma 1 we find that the expected values of 


(7.19) = ww (ow, di Gy, w, dg — 0,/By )exp( — 14,09, ) 
|s| 7/2 

and 

(7.20) 2 wb (w, )at( by, w, )d?( = w,/ By )exp( a La, ) 
Js[< 7/2 


are of order T log T. Therefore 


a ð 
Qr( 2) ~ lim TBE 
da 


Tox 


d 8 
(7.21) lim T75( a = () 
T= x dc 


In order to derive the covariance matrix of T~*/*(dQ,(6,)/da) and 
T °/"(dQ7(0,)/dc), we need to compute the values of the cumulants. 


cum| Die, o(w, aN byt Oy), 
¥ iw, pl, Jd7(0, @,)47( Gy, -o)) 


cum X E OD 0), 0, JA 8, — 0), 


(7.22) 
Z iw p(o, dt Ay, w, dal Gy, -2| ’ 


cum Die ow, di Gy, w, dd A, -w,), 


Vio, ( w, dt Gy, &, )dz( Gy, — =) . 


It 1s easy to see that the first one and the second one are of order 0(T"). The 


TIME DELAY ESTIMATION 575 


third one is equal to 
LIR a Aa wo, thws Yla, cum|d?(6 0? w, )d5( 9, - Wg, J» 


d Tbo» o, Jdal, —w,,)| 


=E E - wo plo, )¥(,, )cum[d?(4), w), d3(8, —,,)| 
(7.23) * $a 


x cum| dJ( 6, —w,,),d 7(4 1 ws) 


T 2; 2 gi w, 0, Yon) Y0, )oum|d7(4,, ws, ),47( 4, w, )| 


x cum| d2( 4, —o,), a3 (4%, ~o,,)| : 


The second term of (7.23) is equal to (2a T3/3 ST AVP OANA) + O(T log T). 
The first term of (7.23) is equal to 


2; D a AT’w w, plon lO) 
(7.24) T 
xX Hy (o, si Ws, Hio, E wg fa (Oa) T O(T*). 


Because 


HJ(A) = y t/T exp(—iAt) 








t=0 
(12) _exp(—id) — exp(—iAT) (T- 1)ep(-iT) 
7 T [1 — exp(—id)]? T(1 — exp(—idA)) ’ 
which is equal to iT /278s at À = w, = 278/T, 8 + 0, expression (7.24) is equal to 
4 
>, -rwo )i2(o,)+ EY AT 520) + OCT?) 
$ sHs (81 — 82) 
(7.26) = a ar = AAIE) dd + 0(T3) 
_ 2a T? 





J IOANE) dA + of). 
The first equality of (7.26) holds since £°_,1/s? = 77/6. 


It may now be seen that the covariance matrix of (dQ7(9))/da, 0Q7(9))/dc) 
converges to 


(7.27) p 





576 S.-T. CHIU 
with 
p= 2r f RPO f(A) + f,(A)] ad 
+2m f NYP(A) f(A) f(A) AA. 
From Lemma 3 it may be seen that if 0p —> 0) as T > œ, then T~? 0°Q,(67-)/da?, 


T? 0°Q,(97)/dc?, and T7? 0°Q,(07)/da dc converge, almost surely, to —d, 
d/2, and —d/3, respectively, with 


(7.28) 


(7.29) d= f YA) f(A) aA. 
Note that 
~~ | 
L al 4 6 
7.20 : = , 
mn = i 6 k 








giving the covariance matrix indicated. 

Next, since the kth-order cumulants of T~°/*(dQ7(6))/da, 0Q7(9,)/dc) are 
of order T!~*/*(log T)*, these converge to zero for k > 2. Therefore, 
T~*(dQ7(0,)/da, 0Q,(8,)/dc) is asymptotically normal and the proof is finished. 

CJ 


We next prove Theorem 3. 


THEOREM 3. Under Assumptions 1 ~4, let 0r @, and Mr > œ, M,/T >Q. 
Then e oe = 0, where 


(7.81) f,(Ao) = aie. 2 d3 (0)dx (~ &,/Br Jexp(—iare,) 


and Iu, is the set containing r Fourier frequencies which are closest to Xo. 
[Here Br! = 1 + ¢c,/T and br = (arp, cr).] 


PROOF OF THEOREM 3. First we note that, as in the proof of Lemma 1, 
E [az Go, W, )al( = w./Br)| 


= E[d?(w,/By)d7(—0,/Br)|exp(iage,) + O(1). 
Now, we calculate the expected value of d7(w,/fy)d7(—w,/B,). It is 


(7.32) 


(7.33) zanr > — SEJRA) + O(1), 
where 

l a 1 — exp(-iAT) 
(7.34) AT(À) = L exp(—iAt) = ‘Temple 


yr 


ee 
~l 
~l 


TIME DELAY ESTIMATION 


Since A’(w,/By — w,/ Br) = T + o(T), we get 
(7.35) E(d7(w,/B))d7(-—e/Br)) = InTf(A») + o(T). 
We therefore have 

1 1 


on MT 2 2rTf (ào) +0(T)| = f (ào) + o(1). 


w E iy 


(7.36) E( f.(Aq)) = 


Next we want to find the variance of f(A). We first show that the covariance of 
dt wi / By )di(—o, /Br) and at w, / By) Ai -w, /Br) for Ooi w., = Im, and w., # 
w., can be neglected. This can be seen from the following argument. We note that 


cum(d"( w, /By)d7(—«,,/Br), 47(—«,,/By)d7( 0, ,/Br)) 
(7.37) = curn(d?( w, / Bo)» di(- w, /Br )curn( d?( w, /Br); dt wo, /By )) 
or curn(d7( w, / Bo ) ; d”( w, /By))cum(d7( 7 w, /Br), at = w, / Br ) ). 


The second term is bounded. For the first term, we have 





cum(d?{ w, /By), ar -, /Br))} = pman me JA) + O(1) 
(7.38) By Br 
= o| (o, — w) '). 


Hence we have 

cum(d7( 8, w, )AT( = w, /Br) dlto 0, AC, /Br)) 
= oflo, =, T 

Then, it can be seen that the variance of f,(A,) is 


(7.40) MEAD EAn) + FLA o) + fod] + Ao) hAon). 
This finishes the proof of the theorem. O 


(7.39) 


Acknowledgments. This paper is part of the author’s doctoral dissertation 
which was written at the University of California, Berkeley under the supervi- 
sion of Professor David R. Brillinger. Professor Brillinger’s guidance, encour- 
agement, and support are gratefully acknowledged. The author would also 
like to thank Professors Bruce Bolt and Kjell Doksum for their valuable 
comments. The author also appreciates the helpful suggestions given by Drs. 
Norman Abrahamson, Ross Ihaka, and George Terrell. The valuable comments 
from the associate editor and the referees are gratefully appreciated. Finally, the 
author would like to thank Dr. Paul Spudich for providing the seismic data. 


REFERENCES 


AKI, K (1967). Sealing law of seismic spectrum. J. Geophys. Res 73 1217-1231 
BoLT, B. A. (1981) Interpretation on strong ground motion records Miscellaneous Paper S-7.3-1, 
Repot 17, U.S. Army Engineers Waterways Experiment Station, Vicksbuig, Miss. 


578 S -T. CHIU 


Boore, D. M. (1977). The motion of the ground in earthquakes. Sct. Amer. 237 68-78. 

BRILLINGER, D. R. (1975). Tune Series Data Analysts and Theory. Holt, Rinehart, and Winston, 
New York. 

Bycrort, G. N. (1982). El Centro differential ground motion array. In The Imperial Valley, 
Caitfornia, Earthquake of October 15, 1979, U.S. Geol. Surv. Prof. Pap. 1254. 351-356. 

CARTER, G. C. (1981). Time delay estimation for passive sonar signal processing. JEEE Trans. 
Acoust. Speech Signal Process. ASSP-29 463-470. 

Cu, S T (1984) Statistical estimation of the parameters of a moving source from anay data. 
Ph.]). dissertation, Univ. California, Berkeley. 

HANNAN, E. J. (1975). Measuring the velocity of a signal. In Perspectives in Probability and 
Statistics (J. Gam, ed.) 227-237. Academic, New York. 

HANNAN, E. J. and ROBINSON, P M. (1973). Lagged regression with unknown lags. J. Roy. Statist. 
Soc. Ser, B 35, 252-267. 

HANNAN, E. J. and THOMSON, P. J. (1981). Delay estimation and the estimation of coherence and 
phase. [EEE Trans Acoust. Speech Signal Process. ASSP-29 485~490. 

JENNRICH, R I. (1969). Asymptotic properties of non-linear least squares estimators. Ann. Math. 
Statist 40 633-643. 

KNAPP, C. H. and CARTER, G. C. (1977) Estimation of time delay in the presence of source or 
receiver motion. J. Acoust. Soc. Amer. 61 1545-1549. 

RED, H. F. (1910). Mechanics of the earthquake. In The California Earthquake of April 18, 1906 2. 
Carnegie Inst. of Washington, D.C. 

SCHULTHEISS, P. M. and WEINSTEIN, E. (1979). Estimation of differential Doppler shifts. J. Acoust. 
Soc. Amer. 68 1412-1419 

SPUDICH, P. and CRANSWICK E. (1984). Direct observation of rupture propagation during the 1979 
Imperial Valley earthquake using a short baseline accelerometer array. Bull. Setsm Soc. 
Amer 74 2083-2114. 

THOMSON, P. J. (1982) Signal estimation using an array of recorders Stochastic Process, Appl. 13 
201-214, 


DEPARTMENT OF MATHEMATICAL SCIENCES 
RICE UNIVERSITY 
HOUSTON, TEXAS 77251 


The Annals of Statistees 
1086, Vol 14, No 2, 578-589 


ESTIMATION FOR A SEMIMARTINGALE REGRESSION 
MODEL USING THE METHOD OF SIEVES' 


By IAN W. McCKEAGUE 
Florida State University 


Estimation by the method of sieves for a semimartingale regression 
model introduced by Aalen (1980) is studied. It is of interest to estimate 
functions which describe the influence of the covariates over time. An estima- 
tor for these functions is introduced and conditions which ensure consistency 
of the estimator in L*-norm are given. Applications to diffusion processes and 
point processes with censored data are also discussed. 


1. Introduction. The method of sieves (Grenander, 1981) has proved to be a 
powerful technique in nonparametric estimation. It has recently been applied to 
stochastic processes for the estimation of such time dependent functions as the 
mean of a translate of the Wiener process (Grenander, 1981; Geman and Hwang, 
1982), the drift coefficient of a linear diffusion process (Nguyen and Pham 1982), 
the hazard function in the multiplicative intensity model for point processes 
(Karr, 1983), and the mean of a Gaussian process (Antoniadis, 1985). 

In the present paper we study estimation by the method of sieves for the 
following semimartingale regression model which was introduced by Aalen (1980). 
It contains diffusion processes and the multiplicative intensity model for point 
processes as important examples. Suppose that n subjects and p covariates for 
each subject are observed over the time interval [0,1]. Let X,(£) denote the state 
of the zth subject at time ¢, and suppose that X = (X,,..., X „y satisfies 


(1.1) X(t) = X(0) + ['¥s)a(s) ds+ M(t), te[0,1], 


where a = (a,,...,@,)’ is a vector of unknown nonrandom functions, Y = (Y,,) is 
the n X p matrix of covariate processes, with Y,, being the jth covariate for the 
ith subject, and M = (M,,...,M,) where each M, is a square integrable 
martingale. 

It is of interest to estimate the functions a,,...,a@, and so provide de- 
tailed information on changes in the influence of the covariates over time. In 
Section 2 we introduce an estimator &” for a and state conditions under which 
PUE) — a,(t)]? dt > 0 in probability as n > oo, for j= 1,..., p. Our ap- 
proach is based on the sieve method developed for linear diffusion processes by 
Nguyen and Pham (1982) and is similar to the well known orthogonal series 
technique for nonparametric density estimation, first used by Cencov (1962). We 


Received October 1984; revised July 1985. 

' Research supported by the U.S. Army Research Office under Grant DAAG 29-82-K-0168. 

AMS 1980 sulject classifications. Primary 62M09, 62G05; secondary 60G44. 

Key words and phrases. Inference for stochastic processes, method of sieves, regression analysis, 
semimartingales, point process, diffusion process, censoring. 


579 


080 I. W. MCKEAGUE 


use an increasing sequence of finite dimensional subspaces of L?[0, 1] and define 
the estimator ĉ(™ to be an element of the nth subspace. The dimension d, of the 
nth subspace is allowed to tend to infinity at the rate d, = o(n) as n > oo. This 
improves on the rate d, = o(n'/*) given by Nguyen and Pham (1982). Various 
moment conditions ((Al)-(A5), Section 2) are imposed on the p covariate 
processes. These conditions are easily satisfied, unless p > 3, in which case 
condition (A4) becomes more severe as p increases. Several examples of our 
model (1.1) are discussed in Section 3; proofs are contained in Section 4. 


2. Estimation of a. We begin by stating some technical assumptions needed 
in the semimartingale regression model (1.1). (Q, F, P) will denote a complete 
probability space and for each i = 1,..., n, (F, t € [0,1]) is a nondecreasing 
right-continuous family of sub-o-fields of F where F contains all P-null sets in 
F. All processes are indexed by t € [0,1]. Each process (M (t), Fa) i= 1,..., P, 
is assumed to be a square integrable martingale such that almost all paths of M, 
are right-continuous on [0,1) with left limits on (0,1]. The predictable variation 
of M; is the unique increasing, (#,,) predictable process (M,), such that (M,), = 
M?(0) and M? — (M_) is a martingale; refer to Meyer (1976). 

The covariate process Y,, is assumed to be (F) predictable, that is measur- 
able with respect to the o-field on [0,1] x Q generated by all left-continuous, 
(F) adapted processes. The o-field F, represents the state and covariate 
history of the ith subject up to time ¢. It is assumed that (X,, n > 1),(M,,n 2 1), 
and (Y, n = 1) for j = 1,..., p are strictly stationary sequences. In particular, 
this will be the case if the subjects are iid. 

The method of sieves consists in taking an estimator from an increasing 
sequence of sets of functions indexed by the sample size. For each 7 = 1,..., p, 
let (¢,,,7 21) be a complete orthonormal sequence in L*[0,1]. Define the 
estimator &'” of a, to be the element of span{¢,,, r = 1,...,d,,} given by 


dy 
(2.1) a(t) = » ar b(t), 
sa 


where (d,,) is an increasing sequence of positive integers, the p X d, matrix 
â™ = (âl?) satisfies 


(2.2) ve(â™) = A™tvee( BM), 


where the vec operator takes a matrix and places the elements in lexicographical 
order to form a large column vector, B is the p X d, matrix given by 


(2.3) Bm = f'o (OY (t) dX t), 
per "0 


and A‘”) is the pd, X pd, matrix partitioned into p? submatrices A‘?) of order 
d, Xd, with 


(2.4) AD. = E S'DI) dt 


pml 


ESTIMATION FOR A SEMIMARTINGALE REGRESSION MODEL 581 


In (2.2), A™-1 denotes a generalized inverse of A‘”’. The choice of generalized 
inverse here does not affect any of our results since, by the proof of Lemma 4.2, 


P(A is invertible) > 1, asn > oo. 


In the case of diffusion processes, &\”) can be derived as a restricted maximum 
likelihood estimator; see Nguyen and Pham (1982). However, for dependent 
observations and arbitrary square integrable martingales no such interpretation 
is available. A rationale for using &\”) comes from the following result which 
establishes the L? consistency of the estimator under some assumptions (A1)-(A6) 
stated after the theorem. 


THEOREM 2.1. Under (A1)-(A6), for any d,, t 00 such that d, = o(n), 


farce) - a (t)| dt>, 0, asn > o. 


ASSUMPTIONS. 
(A1) flt) at <œ, forj=1,..., p. 
0 
(A2) en aN <œ, forj=1,..., p. 
E€ , 
(A3) aut ern) >0, for j=1,..., p. 
EY, (t)Y.,(¢ 
(A4) | i ) | Ji <(p = i)", 


sup - x 
treto) HYD (t)EY}(t) 
for alll <j < k < p, applicable for p > 2. 
(A5) The function 


p(t) = z| [IAM te [0,1], 


is absolutely continuous with bounded derivative (Lebesgue a.e.) for j = 


Lidse P 
(A6) For 1 < k < l< œ, let ¥, denote the o-algebra generated by {¥,,: k <i < 
l}, and denote 
p(n) =sup sup |P(BIA) — P(B)|. 
k21 AEJ}, P(A)>0 
Sse 
Assume that the following g-mixing condition holds 
2 p(n) < œ. 


neil 


582 I. W. MCKEAGUE 


REMARKS. 


(1) Assumptions (A3) and (A4) can be regarded as identifiability criteria. It is 
easy to construct examples which violate each of these assumptions and for 
which a is nonidentifiable. Note that the expression inside the supremum in 
(A4) is bounded above by 1 (Cauchy—Schwarz inequality) so that for p = 2 
(A4) is a very weak requirement. As p increases it quickly becomes a rather 
severe condition on the covariates. 

(11) The fourth moment assumption (A2) is also required in the analysis of some 
Cox-type regression models; see Prentice and Self (1983, page 812). 

(111) Assumptions (A1), (A2), (A5), and stationarity ensure the existence of the 
integrals in (2.3) and (2.4). The martingale integral /)¢,(t)Y,,(t) dM, (t) is 
defined since 

l 2 
E| f'6,(e)¥, (0) am(e)| = 


| [ODY aM.) 


E 
= f 4) dv (t) < œ. 


(iv) Examples of processes satisfying the assumptions (A1)-(A6) are described in 
Section 3. In the important cases of diffusion processes and point processes 
(A5) as a consequence of (A2). 


3. Example. 


3.1. Diffusion processes. Let a(t), t € [0,1] be a continuous function and 
b(x), o(t, x), t€ [0,1], and x ©€@ satisfy the following Lipschitz and growth 
conditions: 


(C1) |b(x) — b(y)|? + jo(t, x) — oft, y)? < K\x — yI’, 
(C2) b*(x) + o7(t,x) < K(1 + x”), 


where K is a constant. Let W = (W, #,) be a Wiener process and 7 an 
#,-measurable random variable. Under these conditions the stochastic differen- 
tial equation 


(3.1) aX, = a(t)b(X,) dt + o(t, X,) dW, tE [0,1], Xo =n, 

has a unique solution X = (X, F,). If the function b is known and n iid copies 
of X are observed, then (3.1) can be expressed in thẹ form of the semimartingale 
regression model (1.1) where M, = fdo(s, X,) dW, and Y(t) = b(X,). Conditions 
(C1), (C2) and the following additional conditions (C3) and (C4) are sufficient for 
Theorem 2.1 to be applicable. 

(C3) Ent < œ, 

(C4) b( X,) vanishes a.s. for no ¢ € [0,1]. 


Assumptions (A2) and (A5) can be checked using (C3) from which a result of 
Liptser and Shiryayev (1973, Theorem 4.6) gives sup, eto, £X | < œ. (A5) then 


ESTIMATION FOR A SEMIMARTINGALE REGRESSION MODEL 583 


follows from (C2) and the fact that (M), = {jo°(s, X,) ds. (A3) follows from 
(C4), the a.s. path continuity of X and the continuity of b. 

Special cases of (3.1) were treated by Grenander (1981) and Geman and Hwang 
(1982) who considered the case b(x) = 1, o(t,x) = 1, and Nguyen and Pham 
(1983) who took b(x) = x, olt, x) = 1. 


3.2. Point processes. Let N = (N(t), ¥,) be a point process with intensity 
p 
(3.2) A(t) = È a (t)Y (t), 
j=l 


where a, is an unknown, continuous, nonnegative function and Y, is an observ- 
able, nonnegative, (%,)-predictable process, j = 1,..., p. Assuming that 


EN(1) < œ, there is a square integrable martingale (M,, #,) such that 
(3.3) N,= [XM s) as +M, te [0,1], 
9 


(see Aalen (1978)) and this is also a form of the semimartingale regression model 
(1.1). Assumption (A5) is a consequence of (A2) in this case since (M), = {jA(s) ds. 

A practical example of this model might arise in which A(t) is the hazard rate 
for the incidence of cancer in a subject who at age ¢ has had a cumulative 
exposure Y (Ł) to each of 7 = 1,..., p carcinogens and for whom N is the point 
process with a single jump at the time of initial detection of cancer. A is set to 
zero after cancer is detected. The functions a,,..., a, in this example represent 
the change in the relative hazard rates for the p carcinogens with age. 

The model (3.2) was introduced by Aalen (1978, 1980) as an alternative to the 
proportional-hazard regression model of Cox (1972). Aalen provided an estimator 
for the cumulative hazard function fja,(s) ds rather than a, itself. For the case 
p = 1, Ramlau-Hansen (1983) has used kernel function methods from density 
estimation and Karr (1983) has used the method of sieves to obtain estimators of 
a,. It is not clear that these two approaches can be extended to p > 1. 


3.3. Processes with both diffusion process and point process components. Let 
p(t), te [0,1] be a continuous function, N = (N(t), ¥,) the point process of 
Section 3.2, 6(x), o(t, x), 7 as in Section 3.1, and e > 0. Then the equation 


(3.4) X,=1+ [B(s)6(X,) ds + fools, X,) dW, + eN, 


has a unique solution X = (.X,) which behaves as a diffusion process between the 
jump times of the point process. The size of the (positive) jumps of X is given by 
e, which is assumed to be known. By substituting (3,3) into (3.4) we obtain 
another example of the semimartingale regression model (1.1) from which the 
functions 8, a,,..., a, can be estimated. 


3.4. Censoring. In many practical situations the available data have been 
randomly censored. The possibility of censoring is easily incorporated into the 
semimartingale regression model (1.1) as follows. Suppose that the state X, and 


084 I W. MCKEAGUE 


covariates Y,, of the ith subject are observable only up to an (.¥,,) stopping time 
7, and that (x, i > 1) is a stationary sequence of random variables. Define new 
state and vana e processes (which are observable over the whole of [0,1]) by 
the stopped processes X(t) = X(t A7,) and Y Y, (t) = Y, (E A 7,), respectively. 
Equivalently, Y,,(t) = I(t < 7,)Y,,(t) could be used i in place of Y, (t). Also define 
a new square integrable martingale, M(t) = M(t A r). The censored version of 
the model is formed by replacing X, Y, and M in (1.1) by X, Y, and M, 
respectively. The assumptions of Theorem 2.1 should now be checked for the 
stopped processes X, Y, and M. 

For M, to satisfy (A5) it is necessary that P(t, > t) > 0, for all 0 <¢< 1. 
This follows from the fact that (M,), = M?(7,) on {w: t27,}, P a.s. In some 
applications it is reasonable to assume oe the censoring is independent of the 
subject (i.e. 7, is independent of X, Y,,, and M,). In this case, by using the 
covariate Y,,(t) = I(t < 7,)Y,,(t), for which the quantity P(r, > ¢) factors out of 
expressions in (A2)—(A5), it suffices to check (A2)—(A5) for the unstopped processes 
and have P(7, = 1) > 0. 

Estimation for an example of the censored semimartingale regression model 
arising in neurophysiology is discussed by Habib and McKeague (1985). 


4, Proofs. The measures p,, j = 1,..., p defined below play an important 
role in the proof of Theorem 2.1. Define p, by dp, (t) = EY? t) dt. Under 
assumptions (A2) and (A3) we have L?([0, 1], di) = L7({0, 1], du) as sets and the 
norms are equivalent. As in Nguyen and Pham (1982), there exists a complete 
orthonormal sequence (¥,,, r = 1) in L*([0,1], dp,) such that 


span{Y,,,r=1,...,d,} =span{¢,,r=1,...,d,}. 


The coordinates of a, and â(™ in the basis (¥,,, r > 1) are denoted ¢,., r = 1 and 
ae r=1,...,d, respectively; Let ¿mM = E, aoe R: ie A Ecni = (Em, r= 
dp Y. It j is “pat that to establish, Theorem 2.1 it oic to ow that 


(4.1) ge _ gin) | >,0, an> o. 
By (2.2) the p X d, matrix {” = (Êm) satisfies 


(4.2) vec( £) = a! vec( b'”), 
where b'”) is the p X d, matrix given by 
(4.3) ym = ny f ENY lE) dX (6), 


tw] 


a'”) is the pd, X pd, matrix partitioned into p° submatrices a‘;) of order 
d, x d, with 


(4.4) a= ny S EOE) dt, 


r=] 


and a'")~' is a generalized inverse of a‘”). Let £") denote the p X d, matrix 
with elements ¢,, and a‘”) the protection of a, onto span{¥,,, r = 1,...,d,}, SO 


aùh, 


ESTIMATION FOR A SEMIMARTINGALE REGRESSION MODEL 585 


that af”) = Deng ¥, Using (1.1) to expand (4.3) it is easily checked that 
(4.5) vee( £0) — gm) = al "veo of, 
where c'”) is the p x dn matrix with 


oP nt E È SEDIYA) alt) — a(t) de 


= 
+n f EADY) aM, (2), 
t=] 


By (4.1) and (4.5) we have that Theorem 2.1 follows from the next two results 
which hold under the conditions of the theorem. 


(4.6) 


LEMMA 4.1. |\vec ce || >, 0, as n > oo. 


LEMMA 4.2. {|ja'")~'||, n > 1} is a tight sequence of random variables (|| - || 
denotes operator norm). 


The following elementary inequality is used in the proof of Lemmas 4.1 and 
4,2. 


LEMMA 4.3. Let (Z,, t€ [0,1]) be a measurable stochastic process such that 
K = sup, eto, EZ? < œ. Then for any integrable function h, 


E| f'ra) <K ra a|. 


PROOF. 


E| f'n, d| = f [hh ElZ,Z,] deat 


() 


< [fir {h,l[EZ2]'7[ £z2|'” ds de 


< K| f'\nide) .0 


PROOF OF LEMMA 4.1. From (4.6) we can write 


> 


where 


ym = =] (n) _ FF x(n) 
n 2 > (7 Yjrik EY k ), 
Am] ra] 


l 
ith = f EYE Yalt) alt) — ak (t) de 
ae x EYnk 


=n Y f ADYE) dM, (2). 


t=] 


586 I. W. MCKEAGUE 


Using (A3) and since (¥,,,7 = 1,...,d,,) are orthonormal in L*([0,1], dp,), we 
have 

P P dy 9 
wens E E E {EF} 


k=l] J=] r=] 


Po dfa [EY (Yalt) 
“phd f zol EY2(t) 


2 


(a,(t) - oO )BYB (ea 





<p 2 | (a(t) — ag (t) EY2(t) at 


* 


J 


if EY, (t) Y(t) 
(aa 


(by Bessel’s inequality) 
spè È filat) ~ of) EYAC) at 


by the Cauchy—Schwarz inequality. It follows that |\vec n™®|| > 0, as n > œ, 
since ai” — a, in L?([0,1], ug) for each k = 1,..., p. Next, by stationarity of 
(Y„iz l), 


ty? 
2 D n 
(4.7) Ely) <p E |n-'var( 44.) + 2n-? E (n — i+ 1)cov( Hy, wa) 
k=] 1=2 
By Lemma 4.3 and (A2) there are constants K,, K,, such that 


E(¥%))' < K| f(E Olat) - (oae 
< Ka f '[ax(t) — a(P(t)]? dt > 0, 
as n — oo. By a result of Ibragimov and Linnik (1971, Lemma 17.2.3), 
cov F I) < 202G — IE(H™) 


so that from (4.7) and (A6), E(y{")? = o(n~*), uniformly in j and r, as n > œ. 
Thus, since d, = o(n), we have 


p d, 
Elvecy™||? = E Z E(y@)* = o(1). 


J=} rel 


Finally consider pP. Using a property of stochastic integrals with respect to 
square integrable martingales, we have 


E | Í Y,,(2)¥,,(t) dM, o| =E | f "b2(t)¥2(t)d(M,), 


= 'W2(t) dv, (t), 


which is uniformly bounded in j and r by (A5). Then, using the mixing condition 


-a 


ESTIMATION FOR A SEMIMARTINGALE REGRESSION MODEL 587 


(A6) as before, it can be checked that E||vec p'")||* = o(1). Collecting terms we 
obtain 


Elvec &™ |? < 3( Eljvec y' |? + |Ivec (||? + Elvec p' ||”) 


~0, an> w. 0 


PROOF OF LEMMA 4.2. We can write a!” = BB!” + ¢ where BY and ¢( 
are partitioned in the same way as a", 


B= a no! È ( BY) -= EBY),), 
Bieta = S EAD Vault) ¥, Yat) dt, 
Me = = EB® 


Using a similar argument to the estimation of E(y{"))* in the proof of Lemma 
4.1, there is a constant K such that 


K 
E( ym) < —E( o 








Thus, 
p d, ý 
EPs E D E(B) 
J,k=l r,i» 
K P d 
~ - 2 E( D) 
n J:k=l ri lal 
K P dha 
=g EE B(S OYOAYA at) 
K > a 5 ~ j| Kah Yalt) ey2(t) atl 
N i k=eil=i \ral EY% (t) 
K D 1f Palt) Y (t) YCE) i 
i n, zelf | EY;,(t) — 
(by Bessel’s inequality) 
OK L fe | BY2(t) A(t) 
E AMO arc | 


by (A2), (A3), and the Cauchy—Schwarz inequality. It follows that 
(4.8) E\p||? = o(1), asn > oo. 


588 I. W. MCKEAGUE 


Now we consider the behavior of [‘") as n > oo. First note that the diagonal 
submatrices of {‘” are identity matrices. Thus, letting J‘") denote the pd, X pd,, 
identity matrix, we have directly from the definition of the operator norm that 

2 


p dy P da 
yer — p= sup A Vide EtG 


[veo x! I] <1 Jelr=l | fxs | 


dn 


(4.9) = sup » Lid feo E siut) 


wee x" Misi y=l r=] l=] 
2 


xX BY, (t)Y,,(¢) dt ? 


where the supremum is over the set of p X d, matrices x™ with |[vecx'” || < 1. 
Let || - ||, denote the norm in L?([0,1], du,) and 


H = |» =(hy,...,A,): hp € L?(L0,1], dug) 


P 
for k = 1,..., pand } |k} < i) 
k=l 
Then it follows from (4.9) that 


2 
Pn p EY,,(t)Y,,(t) 
(mA) _ gm? sia ee Lh 
I~ $7 < sup X XS eto Sato EYUA) | u(t) 


eH yolr=l 


BY, (t)¥4(t) | 


< sup 3 K 3 h(t) | du,(t) 


(4.10) heH j=1*0 k=l EY? (t) 


s(p— aap È fro BOOL auo) 


EH j}, km] EYP (t) EY) (t) 
JER 
<(p- 1)*8, 

where 

2 
ie ag CO 
te{o,1] EY} (t)EYG(t) 
ják 


By (A4) there exists a constant c such that (p — 1)*6 < c < 1. Then, by (4.8) and 
(4.10), 


PLL — aM <e} >1, an> oo. 


ESTIMATION FOR A SEMIMARTINGALE REGRESSION MODEL 589 


But if V is any pd, X pd, matrix such that ||) — V|| < 1 then V is invertible 
and ||V—}\| < (1 — ||“ — VID}. It follows that 


P(a™ is invertible) > 1, asn > œ 
and 


P( ja] <(i- c)') -> 1, an> w.0 


Acknowledgments. I would like to thank Muhammad Habib for several 
helpful conversations during the preparation of this paper. Thanks also to a 
referee for detailed comments which led to a significant improvement in the 


paper. 
REFERENCES 


AALEN, O. O. (1978). Nonparametric inference for a family of counting processes., Ann. Statist. 6 
701—726. 

AALEN, O. O. (1980). A model for nonparametnic regression analysis of counting processes. In Lecture 
Notes ın Statistics 2 1-25. Springer, New York. 

ANTONIADIS, A. (1985). Parametric estimation for the mean of a Gaussian process by the method of 
sieves. Unpublished. 

Cencov, N. N. (1962), Estimation of an unknown distribution density from observations. Soviet 
Math. 3 1559-1562. 

Cox, D. R. (1972). Regression models with life tables (with discussion). J. Roy. Statist. Soc. Ser. B 
34 187-220. 

GEMAN, S. and HWANG, C. R. (1982). Nonparametric maximum likelihood estimation by the method 
of sieves. Ann. Statist. 10 401-414. 

GRENANDER, U. (1981). Abstract Inference. Wiley, New York. 

Habis, M. K. and McK&aGukE, I. W. (1985). Parameter estimation for nonstationary diffusion 
models of neurons. O.N.R. Technical Report 4, Department of Biostatistics, Univ. North 
Carolina, Chapel Hill. 

IBRAGIMOV, I. A. and LINNIK, Yu. V. (1971). Independent and Statonary Sequences of Random 
Variables. Walters-Noordhoff, Netherlands. 

Karr, A. F. (1983). Maximum likelihood estimation in the multiplicative intensity model. Technical 
Report 46, Center for Stochastic Processes, Department of Statistics, Univ. North Carolina, 
Chapel Hill. 

LIPTSER, R. S. and SHIR YAYEV, A. N. (1978). Statistics of Random Processes 1. Springer, New York. 

MEYER, P. A. (1976). Un cours sur les integrales stochastiques. Lecture Notes in Math. 511 245-400. 
Springer, Berlin. 

NGUYEN, H. T. and PHam, T. D. (1982). Identification of nonstationary diffusion model by the 
method of sieves. SIAM J. Control Optim. 20 603-611. 

PRENTICE, R. L. and SELF, S. G. (1983). Asymptotic distribution theory for Cox-type regression 
models with general relative risk form. Ann. Statist. 11 804-813. 

RAMLAU-HANSEN, H. (1983). Smoothing counting process intensities by means of kernel functions. 
Ann. Statist. 11 453-466. 


DEPARTMENT OF STATISTICS 
FLORIDA STATE UNIVERSITY 
TALLAHASSEE, FLORIDA 32306 


The Annals of Statistics 
TURO, Vol 14, No 2, 590-606 


THE DIMENSIONALITY REDUCTION PRINCIPLE 
FOR GENERALIZED ADDITIVE MODELS! 


By CHARLES J. STONE 
University of California, Berkeley 


Let (X,Y) be a pair of random variables such that X = (X,,..., Xy) 
ranges over C =[0,1]”. The conditional distribution of Y given X =x is 
assumed to belong to a suitable exponential family having parameter ņ € R. 
Let ņn = f(x) denote the dependence of 7 on x. Let f* denote the additive 
approximation to f having the maximum possible expected log-likelihood 
under the model. Maxmmum likelihood is used to fit an additive spline 
estimate of {* based on a random sample of size n from the distribution of 
(X, Y). Under suitable conditions such an estimate can be constructed which 
achieves the same (optimal) rate of convergence for general J as for J = 1. 


1. Introduction. In Stone (1985) a variety of parametric, nonparametric, 
and semiparametric statistical models involving an unknown function f were 
discussed with an emphasis on the flexibility, dimensionality, and interpretability 
of the various models. Also, a heuristic dimensionality reduction principle was 
informally introduced. 

Consider, in particular, a pair (X,Y) of random variables, where X = 
(X,,...,X,) ER? and YER; here Y is called a response variable and 
Xs... X j are referred to as predictors. Let f be a function such that f(x) is a 
specific attribute of the conditional distribution of Y given X = x; f is called the 
response function. Let f* be the “best” additive approximation to f. If f itself is 
additive, then f* = f. But even if f* differs somewhat from f, f* may be useful 
in practice especially because of its greater interpretability. 

Consider additive estimates of f* based on a random sample of size n from the 
distribution of (X, Y). According to the dimensionality reduction principle, under 
suitable smoothness conditions on f* and appropriate mild auxiliary conditions 
on the distribution of (X,Y), the optimal rate of convergence for general J 
should be the same as that for J = 1. In the paper cited above a precise result to 
this effect was obtained when f is the regression function of Yon X. Here an 
analogous result will be obtained in a setup that includes logistic regression as a 
special case. 

The setup involves an exponential family of distributions of the form 
e (n+ bay dy) subject to some restrictions which will be described in Section 2. 
The mean p of the distribution is given by u = bo(7) = —3(n)/b{(q); corre- 
spondingly 7 = bz (u), the function b; ' being called the link function. 


Received December 1984; revised August 1985. 

! This research was supported in part by National Science Foundation Grant MCS83-01257. 

AMS 1980 subject classifications. Primary 62G20, secondary 62G05. 

Key words and phrases. Exponential family, nonparametric model, additivity, spline, maximum 
quasi likelihood estimate, rate of convergence. 


590 


DIMENSIONALITY REDUCTION PRINCIPLE 591 


Consider now a model for the joint distribution of (X, Y) in which X E C= 
[0,1]” and the conditional distribution of Y given X = x belongs to the above 
exponential family with n = f(x); correspondingly E(Y|X = x) = bá f(x)), x © 
C. This model is called an exponential response model in accordance with 
terminology introduced by Haberman (1977). The expected log-likelihood for the 
model is given by 


A(a) = E[b,(a(X))¥ + b,(a(X))| 


= E[b,(a(X))b,( {(X)) + b,(a(X))]. 


If f is linear, the model is called a generalized linear model [see Nelder and 
Wedderburn (1972), McCullagh and Nelder (1983), and Dodson (1983)]. If f is 
additive, it is called a generalized additive model in accordance with terminology 
introduced by Hastie and Tibshirani (1984). 

Let the assumption that the conditional distribution of Y given X = x be- 
long to the exponential family be replaced by the weaker assumption that 
E(Y|X = x) = b,( f(x)) for x € C. Let f“ be the best additive approximation to 
f; that is, the additive function that maximizes A(-). The purpose of this paper is 
to verify that under suitable conditions, the dimensionality reduction principle 
holds for estimation of f*; and that the optimal rate of convergence can be 
achieved by a natural and practicable estimate involving the use of maximum 
likelihood to fit an additive spline. 


2. Statement of results. Consider an exponential family of the form 
e(my+ 6m y( dy), where the parameter 7 ranges over R. Here » is a nonzero 


measure on R which is not concentrated at a single point and 
fedimrs oxy (dy) =] for- <1 <0. 


The function 6, is required to be twice continuously differentiable and its first 
derivative b; 1s required to be strictly positive on R. Consequently, b, is strictly 
increasing and b, is twice continuously differentiable on R. The mean p of the 
distribution is given by p = b(n) = ~ bo(y)/0{(n). The function b, is continu- 
ously differentiable and b; is strictly positive on R; so b, is strictly increasing on 
R. Given any positive constant ngo, there are positive constants ¢, and M such 
that 


feterimy bsp (dy) < M for |n| < nand |t| < to. 


Finally, it is required that there be a subinterval S of R such that »v is 
concentrated on S (1.e., »(S°) = 0) and 


(1) bi(n)y + bf(n) <0 forn €R and yes. 
[If bY = 0, then (1) holds automatically.] It follows from (1) that 
(2) bi(n)ba(no) + bz(n) <0 forn, m ER. 


Although (1) seems quite restrictive, it and the other requirements mentioned 


592 C. J. STONE 


above are satisfied in most of the familiar exponential families, including the 
following five examples [see also Wedderburn (1976)]. 


EXAMPLE 1 (Normal). The normal distribution with mean p and fixed vari- 
ance o” is of the required form with b(n) =7/o’, b(n) = —7’/20", and 
S = R. Here b(n) = n and by (qu) = p. 


EXAMPLE 2 (Binomial-logit). The binomial distribution with parameters 
n, and r, with 0O < ~ <1, is of the required form with b(n) =7, b(n) = 
—nylog(1 + e”), and S = [0, no]. Here b(n) = nye"/(1 + e”) and by (pn) = 
logi u/(no — #)) = logit(u/ny) = logit( 7). 


EXAMPLE 3 (Binomial-probit). The binomial distribution from Example 2 
can also be put in the required form with u = b(n) = no(n) and n = by (p) = 
D- (u/n) = ®~ (7), ® being the standard normal distribution function. To do 
so, take b(n) = log(®(q)/(1 — ®(m))), b(n) = nolog(1 — ®(n)), and S = [0, no]. 


EXAMPLE 4 (Poisson). The Poisson distribution with mean p > 0 is of the 
required form with b (n) = 9, b(n) = —e”, and S = [0, 00). Here u = b(n) = e” 
and y = by '(u) = loga). 


EXAMPLE 5 (Gamma). The gamma distribution with parameters a (fixed) 
and A is of the required form with b(n) = ~e~”, b(n) = —an, and S = (0, 0). 
Here u = 6,(n) = ae” and ņ = b; (p) = log(p/a). 


Geometric and other negative binomial distributions can also be put in the 
required form. 

Let (X, Y) be a pair of random variables, where Y € R and X = (X,,..., X3) 
ranges over C = [0,1]?. 


CONDITION 1. The distribution of X is absolutely continuous and its density 
g is bounded away from zero and infinity on C. 


The conditional distribution of Y given X = x is not required to belong to the 
exponential family described above, but the following conditions are required to 
hold. 


CONDITION 2. PY e S)=1. 
CONDITION 3. E(Y|X = x) = b,( f(x)), x E C, where f is bounded on C. 


CONDITION 4. There are positive constants ¢, and M, such that 


E(e'|X =x) <M, for |t|<t) andx€C. 


ah 


DIMENSIONALITY REDUCTION PRINCIPLE 593 


Let .o denote the collection of additive functions a on C such that E|a(X)| < 
æ. Each a € s can be represented in the form 


eJ 
(3) alisan Xy) üg ao F 2a (X,), 


where Ea,(X,) = 0 for 1 <j < J. Clearly ag = Ea(X). It follows from Lemma 1 
of Stone (1985) that under Condition 1 the functional components a„ 1 <j <d, 
are essentially uniquely determined (i.e., uniquely determined up to sets of 
Lebesgue measure zero); and there is at most one continuous version of each such 
function. If a is essentially bounded (i.e., bounded except on a set of Lebesgue 
measure zero), then so are its functional components. 

Set 


A(a) = [[b,(a(x))b3( f(x)) + by(a(x))] a(x) de. 


It follows from Lemma 1 in Section 3 that — oo < A(a) < œ for a e.f. The 
following theorem will be proven in Section 3. Here almost everywhere means 
except on a set of Lebesgue measure zero. 


THEOREM 1. Suppose that Conditions 1 and 3 hold. Then there is a function 
[* ex such that A( f*) = max,.,,A(a); f* is essentially uniquely determined 
and essentially bounded. If f € x, then f* = f almost everywhere. 


The function f* from Theorem 1 is referred to as the best additive approxima- 
tion to the response function f; it can be represented in the form 


e 
Plta EE Xy) = fo + Lite); 
1 


where Ef™(X,) = 0 forl <j <d. 

Let q be a nonnegative integer, let a € (0,1] be such that p = q + a > 0.5, 
and let M, € (0, œ). Let X denote the collection of functions A on [0,1] whose 
qth derivative, A‘, exists and satisfies the Hélder condition with exponent a: 


A(t) — AML) < Mat — te forO <t, t <1. 
CONDITION 5. ff E forl <j <dJ. 
Let N denote a positive integer and let I,,, 1 < x < N, denote the subinter- 


vals of [0,1] defined by Iy, = [(» —1)/N,v»/N) for 1s v<WN and Inn = 
[1 — N~',1]. Let g’ and q” be integers such that q’ > q and q’ > q” > —1. Let 


fu denote the collection of functions s on [0,1] such that 


(i) the restriction of s to Iy, is a polynomial of degree q’ (or less) for 1 < y < N; 
and, if q” > 0, 
(ii) s is q” times continuously differentiable in [0,1]. 


594 C. J. STONE 


A function satisfying (i) is called a piecewise polynomial; if q’ = 0, it is 
piecewise constant. A function satisfying (i) and (ii) is called a spline. Typically, 
splines are considered with q” = q’ — 1 and then called linear, quadratic or cubic 
splines according as q’ = 1, 2, or 3. The N — 1 points 1/N,...,(N — 1)/N are 
called intertor knots. 

Let (X,, Y,),(X_; Y,),... denote independent pairs, each having the same 
distribution as (X,Y) and write X, as (X,,,..., X,,). Consider the random 
sample (X,, Y,),...,(X,, ¥,) of size n. Let N, denote a positive integer and let 
xf, denote the collection of functions a on C of the additive form (3) where the 
functional components a,, 1 < j < J, are such that a, € Sy, and LTa,(X,,) = 0. 
A function in Z, is called an additive spline. 

Let (a) = L [b (aX X,))Y, + b.(a(X,))], a E £, denote the log-likelihood 
function corresponding to the random sample of size n. If i c £, and 1( f.) = 
max,cw!,(a@), then 7, is called the maximum likelihood additive spline estimate 
of f*. It follows from Lemma 14 in Section 4 that under Condition 1 and 
the condition on N, in Theorem 2, except on an event whose probability 
tends to zero with n, f, exists and has a unique representation in the form 
falai Xy) = fo + Eff (x) with LPF, (X,,) = 0 for 1 <j < J. The functions 
ss 1 <j <J, are referred to as the component functions of f; and f, is 
referred to as the constant term. 

The rate of convergence of f. to f* will now be determined. To this end, given 
positive numbers a, and 0, for n > 1, let a, ~ 6, mean that a,/b, is bounded 
away from zero and infinity. Given random variables Z,, n > 1, let Z, = O,,(6,) 
mean that the random variables b, 'Z,, n > 1, are bounded in probability or, 
equivalently, that 


lim limsup Pr(|Z,| > cb,) = 0; 


also let Z„ = O,,(6,) mean that the random variables b, 'Z,, converge to zero in 
probability or, equivalently, that 


lim Pr{|Z,| > cb,) =0 forall c> 0. 


Let |||] denote the L? norm of a function ¢ on C, defined by ||¢||* = 
E(X) = fop (x)g(x)dx. For 1 <j <d let |\h||, denote the L? norm of a 
function A on [0,1], defined by ||h||? = Eh?(X,) = [oh?(x,)g,(x,) dx,. Here g, is 
the marginal density of X,. It follows from Condition 1 that g, is bounded away 
from zero and infinity on [0,1]. 

Recall that J is the number of predictors; f* is the best additive approxima- 
tion to the true response function f; p is the assumed measure of smoothness of 
f* (roughly speaking, the degree of a derivative of f* that is assumed to be 
bounded); n is the sample size; N, — 1 is the number of interior knots; f, is the 
maximum likelihood additive spline estimator of [*; i pee f, y are the compo- 
nent functions of f; and fọ is its constant term. Set y = 1/(2p + 1) and 
r = py. Given a nonnegative integer m, set r„ = (p — m)y. The proof of the next 
result will be given in Section 4. 


DIMENSIONALITY REDUCTION PRINCIPLE 595 


THEOREM 2. Suppose that Conditions 1-5 hold and that N, ~ n”. Then 
(ie g fe) m Ona T), 
m 2 = . 
ae = ( f*)' |= O,(n-*"=) forO<msqandi<sjsd, 





and 
Wh — Pt? = O,,(n-”). 


Theorem 2 lends theoretical support to the use of generalized additive models 
and to maximum likelihood additive spline estimators. It shows that the same 
rates of convergence can be achieved when there are multiple predictors as when 
there is only one predictor. It is clear from the results in Stone (1982) for J = 1 
that these rates (except possibly that for the constant term) are optimal. 

Burman (1985) has recently introduced a selection rule for the parameter N, of 
the maximum likelihood additive spline estimator of f*; it depends on the sample 
data but not on any assumed measure of smoothness of f*. According to his main 
result, which complements Theorem 2, this selection rule is asymptotically 
optimal in a natural sense that also does not depend on any assumed measure of 
smoothness of f*. 

Previously, Hastie and Tibshirani (1984) introduced a procedure for fitting 
generalized additive models that involves “running line smoothers” and a “local 
scoring method” instead of splines and the usual maximum likelihood method. 
Through a number of examples involving real data, they demonstrated the 
usefulness of their procedure in uncovering nonlinear predictor effects. In this 
connection, see also Hastie (1984). 

Cha-Yong Koo and I have recently developed a tentative procedure for fitting 
generalized additive models based on cubic splines and maximum likelihood; it 
allows for subjective decisions about the number of knots and their placement 
and about restrictions on the various component functions that they be linear in 
one or both tails. The procedure has been implemented numerically using 
B-splines [see de Boor (1978) and Section 4] and GLIM [see Baker and Nelder 
(1978)]. We have applied the procedure to the real data sets treated by Hastie 
and Tibshirani and constructed plots of point estimates of component functions, 
plots of confidence interval estimates of these functions, and residual plots (our 
plots of the point estimates are smoother than but otherwise very similar to 
theirs). After examining these plots, we find the procedure to be a promising tool 
for the analysis of data involving a response variable and one or more predictors. 
Some of this work is reported in Stone and Koo (1986). 


3. Proof of Theorem 1. Throughout this section it is assumed that Condi- 
tion 1 holds and that f is bounded. 


LEMMA 1. Given T > 0 there exist £ > 0 and A > 0 such that 
b,(9)b3( 19) + bo(n) <A-—eln| for |no| < Tandy ER, 
b,()b3(19) + baln) < A — elbi(n)| for ino s Tand 7 ER, 


596 C. J. STONE 


and 


bi(n)bs(n) + ban) > (1 + A)(bi(n)balno) + ba(n)) - A? 
for |nol < T, [ny] < T, and ER. 


PROOF. Set W (7) = bi(n)bno) + ban). Then Yi(n)=0 and Y(n) = 
b Cn)ba na) + b(n) < 0 by (2). Since by, by, and b, are continuous, there is a 
5 > 0 such that Y(n) < —ô for [no| < T and |n| < 2T. Consequently, Y, (n) < 
Vv, (22) < —ôT for n = 2T and Y(n) = ôT for n < —2T. Therefore ¥, (7n) < 
F (27) — T(n — 2T) for n> 2t and Y(n) < ¥,(-2T) + 6T(y — oT) for 
n < —2T. The first result follows easily from these two inequalities. The second 
result follows from the first result, since b3 is continuous and strictly positive on 
R. (Replace ņa by nọ + 1 in the first result.) The third result follows from the 
second result. 


Let T now be an upper bound to f on R. It follows from Lemma 1 that 


(4) A(a) <A- e fialg, a eof. 


LEMMA 2. Let Z be a random variable having mean zero. Then E|Z\ < 
2#\u+ Z| foralueR. 


Proor. Let Z*(Z~) denote the maximum of Z(—Z) and 0. Then Z = Z* — 
Z and |Z|=Z*+2Z°, so EZ* = EZ" = E\Z|/2. If u 2 0, then ju + Z| > Z* 
and hence Eju + Z|2> EZ* = E|Z|/2. Similarly if u <0, then Elu+2Z|> 
E|Z|/2. This yields the desired result. 


Let v and V denote positive constants such that v < g < V on C. Then 
v <g,< Von(0,1)forl<j<J. 


LEMMA 3. Leta € £. Then 


2V 
fia, < “3, (A —A(a)) forlsj<d. 


Proor. According to (4), flajg< (A — A(a))/e. Let 1 <j <J. By the 
definition of æ, there is a u € R such that 
A — Ala) 


vE 


fiu + a< fial < Ż fialg < 
Consequently, by Lemma 2, 
1 2 2V 2V 
fia, < = fla,le, < ~ fiu + ajg, s — flu +a|s< <3, (A — A(a)) 


as desired. 


DIMENSIONALITY REDUCTION PRINCIPLE 597 


Let ||||,. denote the L® norm (supremum) of 9. 

LEMMA 4. Let M, be a real constant. Then there is a positive constant M, 
such that the following holds: If a E £ and Ala) = M,, there is ana E€ x such 
that A(@) = A(a) and |la||,, < M,. 

Proor. In the following argument, M,, M;,... denote unspecified positive 


constants which can be defined in terms of M,, o, V, A, e, and J. 
Choose a € æ with A(a) = M}. It follows from Lemma 3 that 


J 


According to the definition of A(a), there is an x, € [0,1] such that if u = a, + 
a,(x,), then 


Bi 
(5) flo u + Eal) os f(X,,.-.,%y)) + by 


ML toa s) dx. acta dx, = A(a). 
Consequently, by the first conclusion of Lemma 1 


J 


and hence |ū| < M,. It follows from (5) that 


flod + Zalos f(X,,...,%,)) + b 


Xg(X,,---, Ly) dr, May dx, = — M}. 


According to the first conclusion of Lemma 1, the quantity in brackets in (6) is 
nonpositive. Thus by Condition 1, 


J J 
flea + dia, (x)| bal f(%1,-..5%4)) + ba + La) — 4| 
<e(x) dx, --- dx, > —M, 
and hence, by the third conclusion of Lemma 1, 
J J 
fiola + Sax] f(x)) + ba + ax,)]| 


Xe(x) dx, oe te dx; = ~ Mg. 
Observe that if ja, + a,(x,)| > Myo, then 


[[bs(az))b5( F(2)) + ba(a(x))] a(x) drz «++ dy < -M 





$a) fa) dxo +- dx < Ms. 





J 
u + a,x, 





A-e 





J 
a + La (z) etia) dx, ++: dx; > A(a) 








J 
u + Las) — a 





(6) 





598 C. J. STONE 


Define @, on R by G,(x,) = a + a,(x,) if |ao + a,(x,)| < Mio and ã(x,) =u 
otherwise. Write G\(x,) = @) + @,(x,), where fa,g, = 0. Then |@, + a@,(x,)| < 
M, for x € [0,1] and hence 


(7) |ao| < Mi; 
and ||a,||,. < M2. Also, if a is defined by 


J 
al Kise xy) = Ao T a,(x,) T 2a, (x), 
2 


then 
(8) A(a@) = A(a). 


By similarly modifying a,, 2 <j < J, we obtain a €. where (7) and (8) hold as 
well as 


(9) lllo <s Mı forlsj<J. 
By (7) and (9), ||a||,, < M,. This completes the proof of the lemma. 


LEMMA 5. Given a positive constant M, there are positive constants M, and 
M, such that if a, E€ £ and |ja lo < M, forj = 1,2, then 


d? 
~M,|la, — a,||? < gz Alta, + (1 — t)a,) < — Milla; — al? forOos<stz<1. 


PROOF. Since 
2 


d 
Gata, + (1 = t)a,) 


= fla, m ay) [bi (ta, + (1 — t)a,)b3( f) + bz(ta, + (1 - t)a,)\g, 
the desired result follows from (2) and continuity. 


PROOF OF THEOREM 1. It follows from (4) that the numbers A(a), a € #%, 
are bounded above by A. Let L denote the least upper bound of these numbers. 
Let a,, k = 1, denote a sequence of elements of . such that lim, A(a,) = L. By 
Lemma 4 it can be assumed that |ja,||, < M, for k = 1. It now follows from 
Lemma 5 and the definition of L that ||a, — a, || > 0 as k, k’ > œ and hence 
that ||a, — f*|| > 0 for some essentially bounded function f*. By Lemma 1 of 
Stone (1985), f* can be chosen to be in æ. Clearly A( f*) = L. Suppose that 
f € of and A( f ) = L. It follows by an argument similar to a portion of the proof 
of Lemma 4 that f is essentially bounded and hence from Lemma 5 that 
If —f*||=0. Thus f* is essentially uniquely determined. Observe that, for 
ny E€ R, the function ¥ on R defined by Y(n) = 6,(7)b3(7,) + b(n) has a unique 
maximum at n = no. The last statement of the theorem is a simple consequence 
of this observation. 


DIMENSIONALITY REDUCTION PRINCIPLE 599 


4, Proof of Theorem 2. Throughout this section it is assumed that Condi- 
tions 1-5 hold and that N, ~ n’. 


LEMMa 6. Let M, be a positive constant. Then there are positive constants 
M, and Mg such that 


lla — f*\? < A(a) — ACf*) < -Malla — f* II 
for alla € & such that |la||,, < M, 


Proor. Given a € with ljalo < My, set a"? = ta + (1 — t)f*. Then 


= 0 


t=0 


d 
a Mai”) 





and hence 
A(a) — A(f*) = fa — Menai dt. 
0 at? 
Since || /*||,. < œ, the desired result now follows from Lemma 5, 


LEMMA 7. There is a positive constant M, such that ||a||,, < M,N}/*|\a\| for 
n > landa E£, 


ProoF. In this proof it can be assumed that fa,g, = 0 for 1 < j < J. Observe 
that 


jal? = fate a + f| Èa a} )| a(x) de. 


By Lemma 1 of Stone (1985) there is a positive constant M,, such that 


poe ) (2) de = Mobs [as ag 


Let 1 <j < J. By Lemma 11 of the same paper there is a positive constant M,, 
such that 


l la (x)| < MaN, j a; g, < M,N, fa? a8) 


for 1 <v < N, and or lalli < M,N, faig, The desired result follows from 
these observations. 


According to (4), Lemma 5, and the definition of £, there is a unique ff € x, 
such that A( fz) = MaX, ew Ala). 


LEMMA 8. || ft — f*||? = O(N, °") and || fi — fll = O(N? P). 


Proor. By Lemma 5 of Stone (1985), a result due to de Boor (1968), and 
Condition 5 there is an f, E £, such that || fa — f*ll, < MoN”; here Mọ is 


600 C. J. STONE 


some positive constant. Consequently, || f, — f*||? < M3, N7?” Thus by Lemma 6 
there is a positive constant M,, such that 


(10) A( f,) — AC f*) = -M N22? forn > 1. 


Let c denote a large positive constant. Choose a € £, with |la — f*||? = cN; ”. 
Then ija — fi? < Yc + M2,)N_?". Now p > 0.5 so by Lemma 7, for n suffi- 
ciently large, ljal < ||/*||_, + 1 for all such a’s. Thus by Lemma 5 there is a 
positive constant M, such that, for n sufficiently large, 


(11) A(a) — A(f*) s ~MycN>?? for all a € x, with |a — f*|] = cN; ?’. 


Let c be chosen so that M,.c > M,,. It follows from (10) and (11) that, for n 
sufficiently large, 


A(a) < A(f,) forall a Ees, with jja — f*||? = cN>??. 


Therefore, by the concavity of A as a function of the parameters of a, 
I fè — ft? < eN>?? for n sufficiently large. This verifies the first conclusion of 
the lemma. Observe that || f* — fhil? = O(N?) and hence by Lemma 7 that 
Wf — fall = O(N”). Consequently, || f* — f*||,, = O(N°°~*), so the second 
conclusion of ve lemma is also valid. 


The next result follows from Conditions 3 and 4 [see the proof of Lemma 12.26 
in Breiman et al. (1984)]. 


LEMMA 9. There are positive constants M,, and M,, such that 
E [e-em X = x] <1+ Mt? forx © Cand |t| < Mo- 


This lemma will be used to verify the next result. 


LEMMA 10. Givens > 0.5y, c > 0, and e > 0, there is a 5 > 0 such that, for 
n sufficiently large, 


P| eof) AIO) L Afa) = ACES) 


for all a € £, with |a — f*|| < cn™*. 


— 2a 


i i as 
> en <2e7%" 











Proor. Observe that 


n 


l(a) = F [b,(a(X,))¥, + b,(a(X,))] 


i ¥ [os (a(X, ))(¥, - bs ( f(X,))) + ba(a(X,)) + b,(a( X,)) by f(X,))]. 
Consequently, 


1 (ai — L(t) — n(A(a) — AC) = ELB), - ECYIX,)) + BX, 


DIMENSIONALITY REDUCTION PRINCIPLE 601 


where 
B(x) = b,(a(x)) — 6,C ft(*)) 
and 
B(x) = bo(a(x)) + b,(a(x))b,( f(x)) - Aa) 
-(b( ft(x)) + b, f*(x))b3( f(%)) — Al pale 
It follows from Lemma 9 that if |tB (x)| < Mio, then 
E [ef BtaxY-BX~201 X = x| < 1 + Mi), t7B?(x) 
and hence 
E |e BiaX¥- 2X20) + Buz] X a x] < (1 i M, ,t?B2(x)) et), 

Thus if ¢7(B?(x) + B3(x)) < Myo, then 

EB et Bia ny BONA S24 S = xl < 1 + ¢B,(x) + Mt?’(B?(x) + B2(x)). 
(Here Mio, M,3,... are unspecified positive constants.) 

Since EB,( X) = 0 it follows that if ¢7(j|B,|!2, + ||B2112) < Mie, then 

Eet BUXKY- ECA) 4+ BEX) < 1 + Mst? | (B? + B2)g< e Mut? (BP + Bg 


Consequently, if ¢7(||B,||2, + || Boll2,) < M,.n?, then 


Ke y (a) < e \} Sí 1 2E/n 


lala) = LC fr) 
n 


Z,(a) =~ ~ (A(a) AAI) 


Set sa = s — 0.5y > 0. Suppose now that a € x, with lla — f*|| < cn™*. Then 
la — fll, < Man by Lemma 7 and hence ||B |2 + IBZ. < Misn~*** and 
{( B? + B3)g < M,,n~**. Therefore 


Eeto < eMiztin 
if |t| < M,gn'***, It follows easily that if ¢/2M,, < M,,n™, then 
Pr(|Z,(a)|= en) < 2e78 “, 
where 6 = e*/4M,.. This completes the proof of the lemma. 


t 2s 


It is a consequence of Conditions 3 and 4 that n` ‘XY, — E(Y,|X,)| is bounded 
in probability and hence that the following result holds. 


LEMMA 11. Given e> 0 and M, > 0, there is a ô > 0 such that, except on 
an event whose probability tends to zero with n, 


l(a) a l(a) 


nS L (Afas) = A(a,))| < en” 


for al a,, a, E £, with lallo < Miz ll@all S Miz, and |a; — aall S ôn”, 


602 C.J. STONE 


It is convenient to define the “diameter” of a subset B of £, as 
sup{||@, — alle: 41, € B}. 


The next result is an obvious consequence of Lemma 7 and the definition of .,. 
[Set S, = {0,1/q’,2/q’,...,1}. Then there is a C, > 0 such that 


max|P| < C, max|P]| 
[0,1] S; 
for all polynomials P of degree q’.] 


LEMMA 12. Given c> 0, ô> 0, ands > 0.5y there is an M,, > 0 such that 
the following property is valid: {a € #,: |a — f*|| < en~*} can be covered by 
O(e™ Nal") subsets each having diameter at most 6n~?*, 


The next result follows from Lemma 6, with f* replaced by ff and # 
replaced by 2%, and Lemmas 10-12. (Note that 1 — 2s > y if s < py.) 


LEMMA 13. Let 0.5y <s < py and c > 0 be given. Then, except on an event 
whose probability tends to zero with n, l(a) < l,( f*) for alla E€ &, such that 
la — fell = en~”. 


Let s and c be as in Lemma 13. It follows easily from (1) and Lemma 3 of 
Stone (1985) that J, is strictly concave on 


{a E Ay: lla — fr] < cn-‘}, 


except on an event whose probability tends to zero with n. Thus the next result 
follows from Lemma 13. 


LEMMA 14. The maximum likelihood additive spline estimate i of f* exists 
and is unique, except on an event whose probability tends to zero with n. 
Moreover, || f, — fall = O(n?) for s < py. 

There is a basis B, 1 < 7 < T,, of Sy consisting of B-splines [see Chapter 
IX of de Boor (1978)]. Here T, < M,,N,, where M,,,... are positive constants. 
These functions are nonnegative and sum to one on [0,1]. Also each B,, is zero 
outside an interval J, of length at most M,,N~' whose end points are in 
{0, N7 ',...,1 — N7',1}. If 1 <7,8 < T, and |ô — t| > Mie then J,, and Jpg 
are disjoint. If s = Li*b,B,, € Sy, then 


Ib, < M,,sups* < M,N, f g 

at Jar 
[see (viii) on page 155 of de Boor’s book and Lemma 11 of Stone (1985)]. 
Consequently, 
T, 2 
2 b, Bn 
1 


Ta Ta 
(12) MN ')b? < f < MaN; ‘Lb. 
l l 








DIMENSIONALITY REDUCTION PRINCIPLE 603 


Set K, = JT, let A,,,1< k < K,, be, in some order, the functions defined 
by A,,(x) = B,,(x,), and write A,, as A, for short. The A,,’s span £, but they 
are not a basis of £, since 1 can be represented in J linearly independent ways 
as a linear combination of the A,’s. Given a K, dimensional column vector 
P = (Bp), set ag = E-B Ap. Then dag/dB, = Ap. Let BF = (Bir) be such that 
7 = vB Ag. 

It is convenient to write /,(ag) as 1,( 8). Observe that 





di, 2 
(13) T = LAX, |b (al X,))Y, T b5(ap(X,))| 
and 
(14) 2h SA, x,) 4 (X,)[ bf ag X,))¥, + Bf ag X,))] 
OB, OB, kiA kati UMBA t 2\“8 i ; 


Let Ê, = (B,,) be such that fa = DB,,A,. The maximum likelihood equations 
for B,, are 


dl, an 
~(B,)=0 forl<k<K,,. 





OB, 
In light of Taylor’s theorem, these equations can be rewritten as 
(15) C,( Ê, Da B>) = — DL,( B3), 


where 
C, = f DLA Bs + t(B, — Bt)) dt. 
0 


Here DI,( 8) is the K,, dimensional vector of elements d1,(8)/08, and D?1,(B) is 
the K,, X K,, dimensional matrix of elements 071,(8)/0B,, OB,,.- 

Let - and | | denote the usual inner product and corresponding norm on R*+. It 
follows from (15) that 


(16) (Ên — Br) Cal Ên — Br) = — (Ên — Br) < DL, Be). 

It will be shown shortly that 

(17) |DL,( B3)? = O,,(7) 

and that 8, and £* can be chosen so that (for some positive constant M,,) 
(18) (B, — Br) + C,(B ~ Br) $ ~Ma Ng ‘nÊ, ~ Bal? 


except on an event whose probability tends to zero with n. It follows from 
(16)-(18) that 


lÊ, = Bal = Opl N?/n) 
and hence from (12) that 
(19) lfa ~ FEI? = Op(N,/n) = Op (07?) 


604 C. J. STONE 


It now follows from Lemma 8 that 
(20) iF Oa 
Let f; be written in the form 
J 
PETOT Xy) = fro T LE); 
i 


where //7,g, = 0 for 1 <j < J. It follows from Lemma 8 together with Lemma 1 
of Stone (1985) that 


(21) Wit, ~ FIG = O(n) forl <7 <d, 
(22) ( n0 zi is): On), 
and 
1 n 
(23) => a (X,,) = O(n") = opn) forl <7 <d. 
i 


Let i. temporarily be written similarly as 


J 
(24) Ti a) = a i Ehh 


where f f, g; = 0for1 <j < J. It follows from (19) and Lemma 1 of Stone (1985) 
that 


(25) fey - fali = O(n”) for 1 sJj<d 
and 
(26) (Fae ~~ n0 i = O(n). 


Choose e > 0. It follows from Lemma 12 of Stone (1985) that 


n 


-E( PESE A o(a) 


l 
= Opl n?) 


and hence from (23) that 
1 n 
(27) = È f(X) = O(n) forl<j<d. 
Let f, be rewritten in the form (24) with 
1 n 
~E f (X,)=0 forl<j<d. 
wi 


It follows from (27) that (25) and (26) continue to hold. It follows from (21), (22), 
(25), and (26) that 


(28) lf,- F= O(n) forlsysd 


DIMENSIONALITY REDUCTION PRINCIPLE 605 


and 
(29) (fro ~ 13)” = 00r): 
It follows from (28) and Lemma 8 of Stone (1985) that 


2 
(30) es Ca ae = O(n) forO<msqandl<j<d. 








Formulas (20), (29), and (30) together constitute the conclusion of Theorem 2. 
It remains to verify (17) and (18). To verify (17) note that 


EA,(X)[bi( )DY + bal f2(X))] = 0. 


Consequently, 
2 


-X LEAI j(X))¥, + BAX)? 


= nL E{Ax X)[b4( fC X)¥ + BCX’) 


K,, 
< Myn} E{AX(X)} 


by Conditions 3 and 4, Theorem 1, and Lemma 8. It follows from the prop- 
erties of B-splines that FA{(X) = EB?(X,) < M.3N,' and hence that 
E|DL,(B*)|? < Man. Therefore (17) holds. 

Finally, (18) will be verified. According to Conditions 2 and 3 there is a 
compact subinterval S, of S such that E(Y|X =x) <8, for x € C. Choose 
e > 0. It now follows from Conditions 2 and 4 that there are subintervals S, and 
S, of S such that S, is closed and bounded on the left, S, is closed and bounded 
on the right, and P(Y € S,|X = x) 2 e and Pr(Y € &|X = x) > e for x EC. 
Given no > 0 set 


S; = {y E S: bYln)y + by(m) < —e for |n| < No} - 
Then e can be chosen sufficiently small so that 


(31) Pr(Y € &|X =x)ze forx ec. 
By Theorem 1, Lemmas 7 and 8, and (20), 7, can be chosen so that 
(32) lim Pr(| fll < Mo and Il fallo < T0) = 1. 


Set $, = {i: 1 <i < n and Y, € S,}. It follows from (14) and (32) that, except on 
an event whose probability tends to zero with n, 


(33) p: C,,B = —e)a%( X,). 
4, 


606 C. J. STONE 


Let B = (B,) ~ (6,,) so that a(x) = Liag (x), where ag, (x,) = Dirb Ba AX,)- 
Let £ now be chosen so that 


(34) 2 ag (X,,)=0 for2sj< J. 
Sn 


It follows from (12), (31), (33), (34), Lemma 12 of Stone (1985), and an extension 
of Lemma 3 of the same paper that, except on an event whose probability tends 
to zero with n, 


J 
da3( X,) > My2, 2 ah (X) 
S, lS, 


J 
> Mæn ilag l 
I 


> MynN>|B\?. 


Therefore (18) holds if Ê, and 8* are chosen so that 8 = Ê, — B% satisfies (34). 
This completes the proof of (18) and hence that of Theorem 2. 


REFERENCES 


BAKER, R. J. and NELDER, J. A. (1978). The GLIM system, Release 3. Generalzed Linear Interac- 
twe Modelling. Numerical Analysis Group, Oxford. 

DE Boor, C. (1968). On uniform approximation by splines. J. Approx. Theory 1 219-235. 

DE Boor, C. (1978). A Practical Guide to Splines. Springer, New York. 

BREIMAN, L., FRIEDMAN, J. H., OLSHEN, R. A. and STONE, C. J. (1984). Classification and Regres- 
ston Trees. Wadsworth, Belmont, Calif. 

BURMAN, P. (1985). Estumation of generalized additive models. Technical Report, Dept. Statistics, 
Rutgers Univ. 

DOBSON, A. J. (1983). An Introduction to Statistical Modelling. Chapman and Hall, London. 

HABERMAN, S. J. (1977). Maximum likelihood estimates in exponential response models. Ann, 
Statst. 5 816-841. 

HAsTIE, T. J. (1984). Comment (on pages 77—78) to graphical methods for assessing logistic regression 
models, by J. M. Landwehr, D. Pregibon, and A. C. Shoemaker. J. Amer. Statist. Assoc. 
79 61-83. 

HASTIE, T. J. and TIBSHIRANI, R. J. (1984). Generalized additive models. Technical Report, Div. 
Biostatistics, Stanford Univ. 

MCCULLAGH, P. and NELDER, J. A. (1983). Generalized Linear Models. Chapman and Hall, London. 

NELDER, J. A. and WEDDERBURN, R. W. M. (1972). Generalized linear models. J. Roy. Statist. Soc. 
Ser. A 136 370-384. 

STONE, C. J. (1982). Optimal global rates of convergence for nonparametric regression, Ann. Statist. 
10 1040-1053. 

STONE, C. J. (1985). Additive regression and other nonparametric models. Ann. Statst. 13 689-705. 

STONE, C. J. and Koo, C.-Y. (1986). Additive splines in statistics. Amer. Statıst. Assoc. 1985 Proc. 
Statst. Comput. Sec. To appear. 

WEDDERBURN, R. W. M. (1976). On the existence of uniqueness of the maximum likelihood estimates 
for generalized linear models. Brometrika 63 27-32. 


DEPARTMENT OF STATISTICS 
UNIVERSITY OF CALIFORNIA 
BERKELEY, CALIFORNIA 94720 


The Annals of Statistics 
1986, Vol 14, No 2, 607-618 


TIME SEQUENTIAL ESTIMATION OF THE EXPONENTIAL 
MEAN UNDER RANDOM WITHDRAWALS! 


By JOSEPH C. GARDINER, V. SUSARLA AND JOHN VAN RyZIN 


Michigan State University, State University of New York-Binghamton 
and Columbia University 


In the context of lifetesting, an asymptotically risk-efficient procedure for 
the estimation of the exponential mean lifetime is considered when the 
survival times of the units are subject to random censorship. The loss function 
is the sum of squared error due to estimation, cost of recruitment of the units, 
and cost of total time on test. Asymptotic properties of the sequential 
estimator and stopping time are described as the per unit cost of total time on 
test decreases to zero. 


1. Introduction. In several statistical experiments pertaining to reliability, 
life tests, and other longitudinal investigations a sample of units on test are under 
continual surveillance until one or the other specified terminal response is 
recorded for each unit. Such experiments may entail a considerable expenditure 
in costs and time particularly if the per unit cost of recruitment of subjects into 
the study and of follow-up time are high. It is then desirable to curtail observa- 
tion at an intermediate state, prior to the last response being recorded, and base 
analyses on the current accumulated statistical evidence should it seem war- 
ranted for the study under consideration. 

In this article we address the problem of estimation of the mean exponential 
lifetime @ from a sample of subjects whose survival times are deterred from 
complete observation due to random withdrawals or censorship. For each unit, 
the censoring variable Y is assumed independent of the survival time X, but is 
otherwise unknown, and the investigator only observes the datum (Z, 6) where 
Z = min( X, Y) and 6 = 1 or 0 according as Z = X or Y. Let Zay., Z,) denote 
the order statistic corresponding to the random sample Z,,...,Z, and 6,,, = 1if 
Z, is uncensored and = 0 otherwise. At the kth response, 1 < k < n, we have 
at our disposal the data {(Z,,),4/,)): 1 < isk} on the basis of which an 
appropriate estimator 6. , Of 8 can be constructed. The loss incurred up to this 
stage is measured by 


(1.1) Lar” a(6, , — 0)? + bn + cV, k 


where the total time on test (TTT) or follow-up time expended is 


k 
(1.2) Vak a X Za) + (n en k)Z ry: 


be] 


The weights a, b, and c are given positive constants; b may be interpreted as the 


Received June 1984; revised February 1985. 

'Research supported in part by the National Institute of General Medical Sciences under Grant 
1R01-GM28406 and in part by the Office of Naval Research under Contract N00014-79-C0522. 

AMS 1980 subject classifications. Primary 62112, 62115; secondary 62G05. 

Key words and phrases. Asymptotic normality, martingale, order statistics. 


607 


608 J.C. GARDINER, V. SUSARLA AND J. VAN RYZIN 


per unit cost of recruitment of units into the study and c the per unit cost of 
follow-up time. The risk in estimating @ by 6. œ 18 thus 


(1.3) R, p =aE(ĝ, p- 0) + bn + cEV, ,, 


and quite naturally, we seek to minimize the risk with appropriate choice of k, 
for a given sample size n and specified constants a, b, and c. With the underlying 
censoring distribution left unspecified and the variables {(Z,,),6;,,:; 1 sis k} 
being neither independent nor identically distributed, we do not have a tractable 
expression for the risk (1.3). However, when n is large, (1.3) can be reduced to a 
more transparent form from which the optimal choice of k can be readily 
obtained. Since @ and the censoring distribution are both unknown it turns out 
that no unique choice of k minimizes the risk universally and hence we explore 
an alternative sequential procedure. 

In this paper we describe a time-sequential scheme for the estimation of @ 
which under some natural assumptions are asymptotically risk efficient as the per 
unit cost c — 0. Furthermore, we derive pertinent asymptotic properties of the 
sequential estimator and stopping number of the proposed scheme. Section 2 
states the main results of the paper together with the notation and assumptions 
to be used in the sequel. The proofs are given in Section 3 followed by several 
auxiliary lemmata whose detailed proofs are available in Gardiner, Susarla, and 
Van Ryzin (1984), which also contains other relevant references. 

A time-sequential procedure analogous to that described in this article has 
been considered by Sen (1980) when censorship is absent. In this case the risk 
(1.3) takes on a particularly simple form for each n,k and further (1.2) is 
expressible as the sum of independent exponential (mean @) variates. In fact we 
may take ô, a = k~ 'Va p and thus (1.3) reduces to k7 lað? + bn + kc. Sen’s 
treatment is ‘remarkably elegant and exploits fully these special circumstances. In 
the present article, our time-sequential procedure is based on the data 
{{(Z.) ôu): 1 sts k},1 sk <n} which are neither independent nor identi- 
cally distributed. Furthermore the presence of random censorship and the nature 
of (1.1) leads to several technical complications which require a more subtle 
analysis. 


2. Main results. Let X be an exponential r.v. with mean 8; @ € (0, œ), and 
let Y have the survival distribution G(-) = P(Y > -). Both 0 and G are un- 
known. Throughout the paper, we assumed that G has a continuous density g on 
its support [0, ¥)), y% < œ. Let H = GF with F(-) = P(X > -). In the random 
censorship model we observe (Z, 6), where Z = X A Y and 6=[X < Y]. The 
symbol [A] denotes the indicator of the event A. For a random sample : 
n (> 1) units on test we have an underlying sequence of iid r.v.’s {(Z,, ô,) 
1 <i < n}. However, in view of the nature of our problem, at any ee cde 
stage k, we only have the data {(Z,,6/,)): 1< isk} as described in the 
introduction. For each k, 1 < k < n, we construct an estimator 6, p of 0 adapted 
to Ba p= = o{(Z aji 51,3): 1<isk} by maximizing its likelihood over 0. This 
leads to a unique maximizer V, ,/6, ,, provided 6, , = L*_,5,,, is nonzero. So 


TIME SEQUENTIAL ESTIMATION 609 


for 1 < k < n, define JE of 6 by 





Vik if 8 0 
(2.1) og TETS 
0 otherwise. 


The risk incurred in estimation is given by (1.3) which we seek to minimize by 
an optimal choice kp of k € {1,..., n}. If k?} (< n) is an optimal selection, the 
experiment is stopped at the k,th stage and ô is estimated by @, , of (2.1) with 
corresponding minimum risk RÌ} = R,, po- In the generality considered here, if 
the weights a, b, c and sample size n are fixed, no explicit mathematical form in 
k for (1.3) is available and therefore we seek a solution to the problem determin- 
ing k?, when n is large. Formally, for a given per unit TTT cost c (> 0) we take 
n = n(c) observations such that 


(2.2) lim en*(c) = a* where a* & (0, 00). 
cd 


For a justification of (2.2), see Sen (1980). We also assume that 
(2.3) b= pec where p & (0,00). 


In the sequel all limits are taken as c | 0. We shall prove that, in this situation, an 
optimal choice of k? can be obtained by examining the behavior of (1.3) along 
sequences {k,} for which n~'k, —> A € (0,1]. Then we show that (1.3) has the 
expansion 


(2.4) Rar, = (a03/b, )kz! + pen + cbyk, + o(k;,'), 

where 

(2.5) by =f" A(x) de, —-H(0)b(0) = -1, 
0 

and 

2.6 H~'(t) =inf{s > 0:1 — H(s) 2 t}. 

(2.6) (£) { (s) 2 ¢} 

From (2.4) an optimal choice is an integer k* where 

(2.7) int(a82/cb2)”” < k* < int(a6?/cb2)” +1. 


Here int(x) denotes the largest integer < x. In view of (2.2), we also have 
n~'k? —> à provided À is a solution of 


(2.8) | HOH x) dx = (a3/a*)'””, 


The left-hand side of (2.10) is a strictly increasing continuous function in A whose 
value at A = 1 is EZ. We can therefore obtain a unique solution in A for (2.10) 
with à € (0,1] provided 


(2.9) œ > a* > ab? /( EZY. 


However à could be small if a* is large. We are therefore led to define the 


610 J.C. GARDINER, V. SUSARLA AND J. VAN RYZIN 


optimal choice k? by 


(2.10) ke = fag if a* > a6*( EZ) *, 

: n otherwise, 
where A satisfies (2.8). Note that if a* > aĝ?({ EZ)? we have À < 1 and so, 
asymptotically as c}0, k? < n. Unfortunately both k? and R? still remain 
unspecified since 0 and G are unknown. This motivates consideration of a 
sequential procedure defined via a random stopping number N, and the associ- 
ated estimator 6 N, (5 ĝ ©- This procedure is shown to be asymptotically risk 
efficient; that is, its risk R* = E[ Lao, n,] is such that R*/R? > 1 as c> 0. 
Observe that V,, , of (1.2) can be expressed 
(2.11) Via = nf PH, (x) dx, 

0 

where nH,(t) = X"_,[Z, > t]. Then in view of (2.1) and (2.8), we define the 
stopping number N, by 


(2.12) N.= pa enata (a 

n if no such k exists. 
We can now state the main results of the paper. All limits are taken as c } 0 in the 
rest of the paper. 


THEOREM 2.1. Under (2.2) (i) N,/k® > 1 a.s. and (ii) E(N./k2)Y > 1 for 
anyr >Q. 


THEOREM 2.2. Under (2.4) (i) ĝ,—> 6 a.s. and (ii) Z(N1(ĝ, — 0)/ô} > 
N (0.1), where Vara, nô = NB. 


ene 


THEOREM 2.3. Under (2.4) and (2.5) R*/R° > 1 as c > 0. 


3. Representation of risk R,,. Before we commence the proofs of 
the theorems of Section 2, we shall provide here arguments leading to 
the risk expansion (2.4) under the restriction to the sequences {%,} such that 
k,/n > he (0,1]. We then show that for the minimization (of risk R,„ ,) 
problem described in Section 2, the optimal sequence cannot be such that 
k„/n > 0. The argument for any other type of sequence {k,} is given right 
before (3.16). Throughout, &, € {1,...,m} and the dependence of k„ on n is 
suppressed for convenience. All the inequalities involving conditional expecta- 
tions are taken to hold almost surely. Let ||€||5 = (£[|§|?]) for any p = 1. 

In view of (2.3), we write for each k, 


(3.1) Ân, & — 6 = (Vig — 98n 4) Sn kl Sn, k = 1] — OL8,,4 = O]. 


Setting U, , = Vy.,—95,,, and kd, , = cpt Waite (n+ 1)p, p= k, 
(3.1) takes the form 


(3.2) Â, a - 9 = O(1/d, ,){(U,, ./R) J W, ahs 


TIME SEQUENTIAL ESTIMATION 611 


where 


W, 


n, 


k 7 (k'U,, p) {kT 8n g > Odak} 


x {k ee ‘LS, 21] — 6[6,, = 0]. 


Our goal below is then to obtain the behavior of the second moment of (6, p79) 
via representation (3.2) and the moments of U, E ôn, p and 87 4. 

By Lemma 4.1, we have that {U?,—- Li, Up "2 Ba l< he n} is a zero 
mean martingale where, for each k, 


(3.4) Uža = (n-k +1)(Zn — Za_y) - 9804) = Vata ~ Otr 


Therefore, 
k 
(3.5) E|u,,| = e| $ e[Uet4,,..]} 
tw] 


Using the conditional distribution of Z,,, given B, ,., and (3.4), we obtain that 
E(U*7|@,, 1) = GE(V,* 12a ,-1) which, via (3.5) and Lemma 4.4, implies that 


(3.6) k'E(U2,) > an f" Ha) dx = 0b, 


(3.3) 


under the assumption k/n — à € [0,1], which is assumed until otherwise stated. 
Therefore, in view of (2.5), we have from (3.2) and (3.6), 


à lg)? 3 T -1 
(3.7) E(G, p 0) (að /b,)k + o(k7') 


+ 07d 774{ E( Wr, a{ Wa, ~ 227 'U,,,})}. 
Since V, , = L4.,V,*,, Lemma 4.4 implies 
(3.8) k'E(V, 4) > by. 


Thus the expansion (2.4) will hold in view of (2.2), (3.7), and (3.8) provided we 
show 


(3.9) E(W, {Wp 4 — 2k7'U, p}) = 0(k7!). 
To obtain this, note that from (3.3) and Hélder’s inequality, 


(3.10) Waa all’ S |Ik Un, llellR7'8,. » — Od, nlle 


XİT 8n Lb. = Ljig + O°P[ ôn, p ii 0]. 
Therefore, applying Lemmata 4.2, 4.4, and 4.6 we obtain 
(3.11) 0 < E(W2,) — 9°P[6, ,=0] = O(k-?). 


When n`'k > A € (0,1], we show P[ô, , = 0] = o(n~°) for any a > 0, whence 
in this case E(W? ,) = O(k7?). Also 


(3.12) E(W,,h(k~'U,, x) < |W, alll” "Un, alle = O( 27°”). 
Combining (3.10), (3.11), and (3.12) will yield the desired (3.9). Thus we are left 


612 J.C. GARDINER, V. SUSARLA AND J. VAN RYZIN 


with proving 
(3.13) P[ ôn , = 0] = o(n7*) 


provided n~'k — A € (0,1]. Consider first the case A = (0,1). Since [6,,= 0] = 
N [Su = 0], we obtain that 


k 
(14) PLn, a= 0] = (nt/(n = BY} f TIRE) 


where the integration is over {(4,,---, ¥,):0<y, < -+: < Yp < oo}. On writing 
Yap- -+s Yen) for the order statistics corresponding to a random sample of size n 
from G, (3.14) may be written as 


(3.15) P[6, = 0] = zl| LI AO 


For arbitrary £ > 0 consider the expectation in (3.15) separately over the events 
[Ya < £] and [Y,, > £]. Then we obtain exponential bounds for the probability 
of these two events; for the first one by a judicious choice of e (k/n — G(e) > 0 
for large n) and Hoeffding’s (1963) inequality, and the second one is bounded by 
(F(2))"~**!, These two bounds lead to (3.13), and (3.13) holds also for A = 1 since 
ô, p 15 nondecreasing in k. 

Observe that when n`'k > 0, Rn , 2 (a0°/b))k~' + 0(k7'). However, 
R „n = (a0"/EZ + a*p + a*EZ jn + o(n~') and therefore as n — oo, we have 

nk > Rn, ne Thus the optimal sequence (k) cannot satisfy n~'k — 0. Also nole 

that whenever n`'k >À € [0,1], AR, , 2 inf{ad’/b, + a*A(p + Ab,)} + 
o(1) = C + o(1), with C = ab?/sup b,. Now the usual arguments utilizing subse- 
quences reveal that for all sequences (k,) we have liminfkR, , > C + o(1). It 
now follows that we may restrict attention to sequences for which n-'k > À € 
(0, 1], in order to obtain the optimal choice k°. 

We now turn to the proofs of the theorems of Section 2. Recall the definition 
of the stopping number N, of (2.14). We also define r, by 


(3.16) ‘i fags <n = 1: 5° r 2 (a/cjn™*}, 
n ifno such k exists. 


Here y > 0 is a constant to be selected later. We prove 


LEMMA 3.1. As cJ0, almost surely, tq <n and +,>k,, where k= 
int[(a/c} n71]. 


Proor. Note that n7'k,. < (a/cen?) Pn 0+? and so from (2.4) as c 0 we 
have k,, < n. Also, the inequality 7, 2 kı, is immediate once we established 
7 <n as. as c40. To this end observe that from (3.15), [7, =n] € [n7'8, n < 
(a/c) nY + n~']. Hence since E(n7'8, ,) = 07 'EZ and n(c) ~ c, we 
obtain that for any sequence (c,,)10 and n large enough Pir, =n]< 
Pl{n7'8, „— 07 'EZ} < —(26)~'EZ]. The result now follows from Hoeffding’s 
(1963) inequality and the Borel-Cantelli lemma. 


TIME SEQUENTIAL ESTIMATION 613 


The salient properties of the stopping number NC are summarized in the 
following lemma. For 0 < e < 1, write ks, = int(k?(1 — e)) and ky, = int(k}(1 + 
e)), where k? was defined in (2.10). Notice that with c sufficiently small, 
ki. < ky, < kze In the sequel all limits are as c }0, unless otherwise stated. 


LEMMA 3.2. As cJ0, almost surely (i) N, 2 t, and (ii) N, < n if a*( EZ)? > 
aĝ’. 

Furthermore, for p > 2, (iii) PIN, > R3,] = O(e?”*), and (iv) PLN, < ka] = 
oO cP AINAR, 


Proor. (i) Observe that in view of Lemma 3.1, 
N, 
(3.17) [N. < m] = [N, < ku] ULM > Riel] N [624 < (a/c)n™]]. 
h=hy, 
Now [N, < kie] = [62,2 (a/ce)V, pœ for some k€ {1,...,k, — 1}] € [Za s 
n70+]. Similarly the second event on the r.h.s. of (3. 17) is contained 
in [Zp s n0t]. Since the series L{P[X < n 0+0] + PLY < n+] 
is convergent, it now follows from the above set of inequalities and Theorem 
4.3.3 of Galambos (1978) that for any sequence (c,,) with Cm 40 as m > œ, 
Ln PLN, < T, _] < œ. This gives (i) of the lemma. 

(ii) By definition [N, =n] = [62 „nı < (a/e)V, ,-,]. In view of Lemma 4.4, 
(n — 1)`'V, „-ı > EZ as. and = b n-1 29 EZ a.s. Hence from 
(2.4) and our pohe [N, = n] > [a* < ab?/EZ?] = 0 a.s. which entails 
N. <n as. 

To establish (iii) note that PIN, > k3.]= PIV; z, > (c/a)ðf p (O r, T 
(a/c)@)|. Write K for a generic constant not depending on c. From Lemma 4.4, 
kiesne, 7 8 'biare a8. and further since ab? = (Xb,)’a* we can take 
{((23'8,, p) ~ (a/0)bkze} > K (> 0) by taking c sufficiently small. Hence we 
obtain, for small c, PIN. > kse] $ K~PR3PE|U, p |? = O(c”), by Lemma 4.2, 
provided p = 2, completing the proof of (iii). 

(iv) We first observe that from definition n~'k,, > A(1 — e) where A < 1 or 
A = 1 according as a* > af*®/(EZ)* or a* < a0?/(EZ)*. For either case the 
proof of (iii) is the same. Observe that 


[N, < ky, ] = [83 , > (a/c)V, ,, for some k € {1,,..., Re.) | 
c [Una < (c/a)é, oe eee (a/c)6), for some k = (irenka l 


Therefore we have 


(3.18) P[N, < k] <P 





Ù {U, aS (c/a)d,. „(82 n ka - (az) 


mT 


In view of Lemma 4.4, ee the fact that (Ab,)? = (a0*/a*), we may take 
{(k2'8, 4,” — (a/c)0kz2} < —K, with K > 0 for sufficiently small c. Thus in 
(3.18) we obtain for small c 


(3.19) PLN. < ka.] < P| max |U, eS Keg), 
hk, <ksk), , 


614 J C. GARDINER, V. SUSARLA AND J. VAN RYZIN 


Applying the maximal inequality to the martingale {U, ,} and Lemma 4.2 yields 
(3.20) = PLN, < kze] < K (ka)? (Rae)? = O( cP -277?), 
completing the proof of (iv). 


PROOF OF THEOREM 2.1. In view of (iii) and (iv) of Lemma 3.2, we have 
immediately (k’)~'N. > 1 a.s., by selecting y < 4 a priori and p (2 2) large 
enough. 

Furthermore (k?) N, = (n-'N.\n7'R®)7! < (n7'R®)7! as. as c10, 
n 'k,, >A > 0. Thus (k°)7'N, is as. bounded for sufficiently small c. It then 
follows from the dominated convergence theorem that E(k°)~"N/ > 1 for any 
r> 0. 


PROOF OF THEOREM 2.2. (i) In view of Lemma 4.4 we have that k~ 'V, p > by 
a.s. whenever n~'k —> à € (0,1]. Therefore from (2.10) and part (i) of Theorem 
2.1 the result obtains. 

(ii) Notice that 6? > 6°/b, as. We first show 
(3.21) L|(k°)'"(8,, 4 — 8)| > N(0, 67/0,). 


n, 


If a* < a0*( EZY ?, then k? = n and ||W, „|| = O(n~”). Then (3.21) follows from 
(3.2) and an application of the ordinary central limit theorem. If a* > a@?(EZ)~? 
then n™'k® > € (0,1) with à satisfying (2.8). Now (3.21) obtains once we 
establish 


(3.22) g |(k2) "PU, s] > 10, 8b). 


To this end we apply the martingale central limit theorem of MacLeish (1974, 
Corollary 3.8). From Lemma 4.6 and the arguments preceding (3.6) we get 


Re 
(3.23) (k?) E, E(U3?|@, 1) > 06, in probability. 

t=] 
Furthermore, for each £ > 0 and ņ > 0, E(U*?[|U,*,| > e(k2)'/’]) is bounded 
by (°k?) "ŽENU, |2+7) and, from the proof of Lemma 4.2 we have that for 
p 2l, (ko yD, E(\U,*,/?) is bounded in c. It now follows that 
(k?) ipa? E(U UF > (kVA +0 and so together with (3.23) we get 
(3.22). Finally Theorem 2.2 will be established once we prove E( 6. i 6 we) = 
o(c'/*) which follows from (3.25) and (3.26) below. 


PROOF OF THEOREM 2.3. The proof of the theorem follows along lines 
analogous to those of Theorem 1 of Gardiner and Susarla (1984) and therefore 
only an outline of the details is provided here. Notice that 


R*/R? -1 = age” E(8,, n — 8) - Elé,- 8) } 


(3.24) +a,c'/7{ EV, N EV, 4} 
=A, +A, (say), 


TIME SEQUENTIAL ESTIMATION 615 


where ap, a, are constants independent of c. Since E(6, er 6)? = (ab?) e + 
o(c'/*) and uang the Hölder inequality we can establish Ay = 0(1) in (3.24) 
once we show 


(3.25) EX(4,. n, > 6, 4) [Ree < N, sS ky} T o(c'’*), 
(3.26) E((4,, x — aY EN, < k,,or N, > kael) = o(c'”), 


and a corresponding statement of (3.26) with 6 „n, replaced by Â, x». To establish 
(3.26) consider first E{(6, ne OYIN, < El. (The proof of the remaining part 
of (3.26) is entirely analogous. ) From (8.2), 


E{(4,, n- 9) TN. < kael} < E{( NZ Un, n) LN. < Reel} 


+E(W2? NLN, = kael) 


(3.27) 


where here and in the sequel, we have suppressed constants not depending 
on c. Now the first term on the r.h.s. of (3.27) is bounded by 
E'/P{maX p, <e sk (R Un, pY PCPIN, < Rae])!/7 with p—' + q7! = 1. On apply- 
ing the maximal inequality, Lemmata 3.2 and 4.2 we fnd that the term is of 
order O(c?’ +D) where h = 1{(8/4q) — 1 — ((s/2q) + 1)y} and s > 2. Taking 
y < | and (s/q) appropriately we have h > 0. This yields 


(3.28) E(( NZ 'U,, v) LN. < koe ]} = o(cl’). 


For the second term in (3.27), use Lemma 3.2 and the definition t, to drop the 
term [ôn , = 0] in W, , of (3.3). Then using the fact that W, ,||, = o(k7') for 
p > 1, the Hélder inequality and Lemma 3.2 we obtain E(W? y [N. < kaJ) = 
O(c!’ 2) whence in view of (3.28) the proof of (3.26) may be terminated. 

To show (3.25) we again use (3.2) and eral the two eu involving Un k 
and W,, separately. Then bound E({N7'U, y — (RD) Un, kY lke < Ne = 
k, D) by E(max,, siak Ue Ry U, w) Notice that for k? <k ka, 


(kU, i (R9) U, ye) < 2(k?) (Uy 7 Un, 42) 
(3.29) | 


+2{(ko) = hz} U2 ho 


For the term in (3.29) use the maximal inequality of the martingale 
((U, Unke): k? < k < k;,} and for the second term use Lemma 4.2. Then 
nce e can be arbitrarily chosen and (k°)~? ~ c we get 


2 
(3.30) B| max (AU, 4 = (#2) 'Un ae) =o(c'”), 
ko<ksk,, 


For the term involving W, , in (3.25) follow the same argument using the fact 
that E(W?,) = O(k~*). Then together with (3.30) this establishes (3.25). We are 
left with the term A, in (3.24). Once again consider separately the expectation 


616 J.C GARDINER, V. SUSARLA AND J. VAN RYZIN 


E(V,.n, ~ Va, ke) restricted to each of the events [N, < kze}, [N, > kz], and 
[Ry < “N, < E.l For the first term use Lemma 3.2 with y < | to show that it is 
of order Or ce! 27)/12) and likewise the second term is of order ole 2), Finally for 
the last term we have a bound e O(c” '”*). This establishes A, = o(1) and so from 
(3.24) the theorem is proven. 


4. Auxiliary lemmata. We present here the proofs of several auxiliary 
results utilized in the proofs of Section 3 some of which are of interest in 
themselves. In particular, Lemma 4.1 gives a general martingale result and 
Lemma 4.3 gives a moment bound on a useful functional of centered empiricals. If 
£,,...,§, are n uniform (0,1) r.v.’s, nD (t) = £7_,[&, < £] will denote its em- 
pirical d.f. Also if J is the identity function on (0,1) and g(t) = {i — t)}”, 
r > 2, t€ (0,1), we write p(T, J) = sup{ |T (£) — t]/q(t): 0 < t< 1}. ey, Cos... 
are constants independent of n and of any k in {1,..., 7}. 


Lemma 4.1. For any n21, {Unk np 1s k<n} is a zero-mean 
martingale. 


LEMMA 4.2. For any p 2 2 andn, ||U, klip < ck”. 


Proor. By the theorem of Dharmadhikari, Fabian, and Jogdeo (1968) and 
Lemma 4.1, ||U,, allp < Cli U?) lp for p > 1 with c, depending only on p. 
Now an application of Hilder’ 8 inequality followed by c_-inequality obtains that 
the right-hand side of the above inequality is O(k?/*) provided that 
E[R7'L%_V,*?)] <M < œ with M independent of k and n. But this follows 
since 


(4.1) E[V,t?1B,,.1] = pfx? '(H(Z + x(n —i+1)"')/H(Z)}" at, 


where Z,,_,, has been abbreviated to Z and since the integrand can be dominated 
by the integrable function x?~ 'exp(—x/8). 


LEMMA 4.3. For each p > 0, |o {Tn 2)||, = O(n '/*), provided r > 2 V p. 


ProoF. We obtain the result by showing that 
A= n?/2E | {sup{|T,(t) —t\/q(t):0<t< 1147] = O(1). 


To get this result, we take 8 = 4 and q as defined earlier in Theorem 1 of 
Wellner (1977b). Then with Y, as in Wellner’s result, we have E[|Y,|?] < co 
for p <r. Now let p22 until otherwise stated. Now Wellner’s theorem 
obtains that A < c,E[|T,|?] where nT, = LY, with Y, = q7 '(é,X0 < &, < 4] — 
f§(1/G. — x)q,/.(x)) dx. Hence an application of Burkholder’s inequality fol- 
lowed by Jensen’s inequality (need p > 2 here) to E[|T,|?] shows that A < 
c,E[|Y,|?] < œ, completing the proof for p 2 2. For p<2, A<1+ 
pfe dA?~'Pisup{vn |T (t) — t{/q(t): 0 < t< 4} >A]dA and again by Wellner’s 
theorem, the last integral is at most c,( {“A?~? dA) E[Y/] < œ, completing the 
proof of the lemma. 


TIME SEQUENTIAL ESTIMATION 617 


LEMMA 4.4. For any p > 0, k Vn, k E da, allp i Ock- and ae a 
0d, ,|| = Ok). 


ProoFr. Since the second result follows from the first result and Lemma 4.2, 
we prove the first result only. From (2.11), we have 


n y n Z “ijy 
k- Vae Int = ($) ra =) xt nA ated sia 


=I +H (say). 
Define the function k on [0,1] by Rk(-) = JË OH(x) dx. By our assumptions, k 
is differentiable on [0,1) with k(t) < 8. So by the mean value theorem, 


Žir = 21,2 
f H-[" (Pa, H |p < 8184) ~ Pn allp 


0 

where £,, is the kth order statistic corresponding to a random sample size 
n from the uniform distribution on (0,1). By Lemma 2 of Wellner (1977a), 
Eck) — Pn, all p = O(R’/n), whence in view of (4.3) we have IT = O(k~'/*). To 
handle J of (4.2) note that with q as in Lemma 4.3, we have 


[ha = H)|< p(T, r) f alH). 


Since [F g(H) < oo, Lemma 4.3 and (4.4) yield that I = O(k7'/”) provided 
à = liminf k/n > 0. If A=0, there exists a subsequence {n,}, such that 
k,,/m, — 0, along which we show I = O(k~'/*). It then follows from the usual 
subsequence arguments and the first part of our proof, that this same order for I 
obtains for all sequences {k,}. Thus in the sequel to show I = O(k''”*) we 
assume n~'k > 0. Note that 


[O = BD < oln Dl an > 3I S700) + n D| M), 


where p*(T,, J) = sup{|I,(¢) — ¢|/ — t): 0 < t < 3} and p, as in Lemma 4.3. 
For the first term in (4.5), apply Lemma 4.3 together with the fact the P[é,,, > 4] 
has an exponential rate of convergence to zero. This yields 


(4.6) leg Eas Z) [Say > tlp = O(n 'R'7?). 

For the second term in (4.5) on noting that {(1 — t)“ (T(t) -— 0):0<ts 4} isa 
martingale yields ||p*(I,,, Dll, = O(n"). Treat ff H by triangulation using 
(4.3) and the fact that (n/k) fE */®H > (hO). We will get || fH], = 
O( k/n) and so in view of (4.5) and (4.6) the proof may be terminated. 


(4.2) 


(4.3) 








(4.4) 





(4.5) 





Lemma 4.5. For p > 0, kêz kiên k 2 Ll, = OC). 
Proor. Write 
k 
E|(k-"S, ,) "(8,42 1]] = E (279) PIS, =i] 
j=l 


= V+) +} =74+T, 


J< ER J<ek j>ek 


(4.7) 


618 J.C. GARDINER, V. SUSARLA AND J. VAN RYZIN 


where e (> 0) will be selected later on. Whenever n~'k > à we have d, , > b, 
with b, defined in (2.5). Therefore, by the usual subsequence arguments we have 
liminfd, , = A(> 0) where A = inf{b,: A € [0,1]}. Now choose e, a priori such 
that ðe < A. Then for the sufficiently large n, P[8, , < ek] < P[|k-'S, k 
0° 'd,, p| > d] for a d> 0. The lemma now follows from (4.7) and Lemma 4.4 
upon observing that I < kPP[ô, p < ek] < k POR?) and II < e™. 


LEMMA 4.6. If k/n > € (0,1), then for p > 0 


k -] 
B! E BVM Bn, sar) >, Tp + D f(x) /h(x) VAC) dr. 


i=] 


ProoFr. Follows from arguments similar to those in Lemma 4.1 of Gardiner 
(1982) on first showing that 


p 
sup |E(V,*°|@, .-1) — PUp + D{H(Zu-0)/Al Zu) } | >r 0- 


lsisk 


Acknowledgments. The authors wish to thank the associate editor (espe- 
cially in connection with the proof of Lemma 4.3) and the referees for carefully 
reading the paper and for suggestions which led to an improved version of the 


paper. 


REFERENCES 


DHARMADHIKARI, S. W., FABIAN, V. and JoGpEO, K. (1968). Bounds on moments of martingales. 
Ann. Math. Statst. 39 1719-1723. 

GALAMBCOS, J. (1978). The Asymptotic Theory of Extreme Order Statistics. Wiley, New York. 

GARDINER, J. (1982). Local asymptotic normality of progressively censored likelihood ratio statistics 
and applications. J. Multivariate Anal. 12 230-247. 

GARDINER, J. C. and SUSARLA, V. (1984). Risk efficient estimation of the mean exponential survival 
time under random censorship. Proc. Nat. Acad. Set. 81 5906-5909. 

GARDINER, J., SUSARLA, Y. and VAN RYZIN, J. (1984). Time sequential estimation of the exponential 
mean under random withdrawals. Technical Report No. B-39, Department of Statistics, 
Columbia University. 

HOEFFDING, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. 
Statst. Assoc. §8 13-30. 

Macl.£1sH, D, L. (1974). Dependent central limit theorems and invariance principles. Ann. Probab. 
2 620-628. 

SEN, P. K. (1980). On time-sequential point estimation of the mean of an exponential distribution. 
Comm. Statist. A 9 27-38. 

WELLNER, J. A. (1977a). A law of the iterated logarithm for functions of order statistics. Ann. 
Statist. § 481-494. 

WELLNER, J. A. (1977b). A martingale inequality for empirical processes. Ann. Probab. § 303-308. 


J. C. GARDINER V. SUSARLA 
DEPARTMENT OF STATISTICS AND PROBABILITY DEPARTMENT OF MATHEMATICAL SCIENCES 
MICHIGAN STATE UNIVERSITY STATE UNIVERSITY OF NEW YORK 
EAST LANSING, MICHIGAN 48824 BINGHAMTON, NEW YORK 13901 
Jd. VAN RYZIN 


DEPARTMENT OF STATISTICS 
COLUMBIA UNIVERSITY 
New York, N.Y. 10025 


The Annals of Statistic» 
LING, Vol 14, No. 2, 619-637 


EMPIRICAL PROCESSES ASSOCIATED WITH V-STATISTICS 
AND A CLASS OF ESTIMATORS UNDER RANDOM 
CENSORING 


By MICHAEL G. AKRITAS 
The Pennsylvania State Uniwersity 


A class of empirical processes associated with V-statistics (V-empirical 
process) under random censoring, and a class of nonparametric estimators 
based on the corresponding quantile process are defined. The V-empuirical 
process is the censored data analogue of the U-empirical process considered 
by Silverman (1976, 1983). The class of estimators ıs the analogue of the class 
of generalized [-statistics introduced by Serfling (1984) and it includes the 
results of Sander (1975). The weak convergence of the V-empirical process and 
the corresponding quantile process is obtained and, through that, the asymp- 
totic behavior of the estimators is studied. Linear bounds for the 
Kaplan— Meier estimator near the origin are established. A number of exam- 
ples are given, including the generalization of the Hodges-Lehmann estimator 
for estimating the treatment effect in the two-sample problem under random 
censoring. A measure of spread, a procedure for estimation in the two-way 
ANOVA model, and a modified version of the two-sample Hodges- Lehmann 
estimator, all of which are new even in the uncensored case, are proposed. 


1. Introduction. For each s, s =1,...,k, let Xp,..., Xon, be a sample of 
independent identically distributed observations with distribution function F, 
(iid. FL) and Y,,,...,Y,y be iid. G,. Assume that X$, Ym, are independent for 


bp) 
all s, m=1,...,k, J= 1,..., Np, t=1,...,N,. For each s = 1,..., 2 we ob- 
serve 


(1.1) X,,=min(X°,Y,,) and 6,,=J[X,,=X2], i=1,..., N, 
Clearly X,,,.-., A,n, are iid. H, where (1 — H,) = (1 — FX1 — G,). For r, < 


8 


N, s=1,.... k, let 
(1.2) hpa TAA ig cae Riis ees a) 


be a real valued kernel where r = (r,,..., r4). In this paper we deal with the 
problem of estimating some functional of the distribution of h, , under F,,..., F, 
(such as the median, or some other linear combination of its quantiles) when the 
sample is of the form (1.1) (random censorship model). 

In the uncensored case the problem of estimating the mean of h,., was 
initiated by Hoeffding (1948), who introduced the class of U-statistics and 
triggered a long sequence of interesting research. See Serfling (1980) for a modern 
treatment and references. However, the problem of estimating other functionals 
of the distribution of A, , did not receive any attention until very recently 


Received November 1982; revised October 1985. 

AMS 1980 subject classifications. Primary 62G05, 62G30; secondary 62E20. 

Key words and phrases. V-statistics, empirical processes, random censoring, L-statıstics, 
Kaplan-Meier estimator, Hodges- Lehmann estimator. 


619 


620 M. G. AKRITAS 


(Serfling, 1984). This is even more surprising in view of the fact that the lack of 
robustness of averages was recognized a long time ago. 
The class of estimators to be studied includes such statistics as 


Xi, ae es +X), 
med{ Pe 


r r 


Xaj, + oe + Ao, 


? 


iy i= len, Ny, fee = eeN 


(and its censored data analogue), which can be thought of as a Hodges-Lehmann 
type statistic for estimating the shift in the two-sample case. A similar extension 
of the one-sample Hodges—Lehmann estimator, namely 


KAGAN 
med i =1,...,0] 
r 


was considered (in the uncensored case) by Serfling and Thornton 
(1982): it was found that using r = 3 (instead of the usual r = 2) increases the 
asymptotic relative efficiency from 0.95 to 0.98, while r = 4 increases it to 0.99. 

Consider for simplicity the case k = 1. In the case of uncensored observations, 
the problem of estimating the distribution of h., has been treated by con- 
structing the empirical distribution function corresponding to the set of 
N(N — 1)--+(N—r-+ 1) random variables A(X,,,..., Xu) obtained by every 
possible choice of ordered sets of r distinct integers drawn from 1,..., N. Such 
empirical processes were considered by Silverman (1976, 1983). See also Serfling 
and Thornton (1982). The problem of estimating functionals of the distribution of 
h, , was initiated by Serfling (1984), who defined a class of generalized L-, M-, 
and #-statistics essentially by placing the above mentioned empirical distribu- 
tion function into the functional form of the usual L-, M-, and R-statistics. 

In this paper we consider the random censoring model and deal with the 
problem of estimating the distribution of h,. , as well as functionals thereof, thus 
extending the results of Serfling (1984) and Silverman (1976, 1983) in this case. 
For reasons that will become apparent, we consider instead the empirical distri- 
bution function corresponding to V-statistics [cf. Serfling (1980), page 174]. 
Section 2 presents, for illustrative purposes, the generalization of the 
Hodges—Lehmann estimator for estimating the treatment effect in the two 
sample problem under random censoring. In Section 3 we present the empirical 
distribution function corresponding to the general kernel (1.2), establish its weak 
convergence to a Gaussian process and do similarly for the corresponding quantile 
process. The proof uses a Skorokhod construction; in the absence of censoring the 
results we obtain are identical with that of Silverman (1976). Incidentally, we 
show that in the uncensored case the V-process is asymptotically equivalent to 
the U-process so that our method provides a simpler proof for the weak conver- 
gence of the process considered by Silverman. In Section 4 we follow Serfling 
(1984) in defining generalized L-estimators. This does not only extend Serfling’s 
results to the case of censored survival data, it also generalizes the results of 


ESTIMATION UNDER RANDOM CENSORING 621 


Sander (1975). Serfling’s approach of differentiable statistical functionals could 
also be applied in our case; however, we chose to present a proof adapted from 
Shorack (1972). A number of results concerning the behavior of the ratio of the 
V-empirical process to the true distribution of the kernel that are required for 
such a proof are formulated and proved in the appendix. In particular, we 
establish linear bounds for the Kaplan—Meier estimator near the origin. A 
number of examples including a modified version of the two-sample Hodges— 
Lehmann estimator, a measure of spread, and a procedure of estimation in 
two-way ANOVA, all of which are new even in the uncensored case, are presented 
in Section 5. 


2. The Hodges—Lehmann estimator. In the notation of Section 1, let 
k = 2 and consider the kernel 


(2.1) Aoi z h(x; x) = Xi T Xo. 


In the uncensored case the Hodges-Lehmann estimator for estimating the 
treatment effect in the two sample problem is the median of the uniform 
probability measure that assigns mass N, ‘N, ' to each of the points h( X,,; X3,), 
t=1,...,N, J=1,..., Np. Noting that N~}, s = 1,2 is the mass assigned to 
each X,., Paleng N,, by “the corresponding empirical distribution function, we 
conclude that an appropriate analogue of the Hodges—Lehmann estimator in the 
presence of censoring is the median of the probability measure that assigns mass 
[Â (X) - F(X, - Ne [F(X2,) - f(X,- )] to each of the points A(X; Xe ,)- 
Here f = Ê No 87= 1,2, is the Kaplan-Meier estimator corresponding to the 
sth sample (Kaplan and Meier, 1958). Thus, the above weights are nonzero only 
if both X,, and X,, are uncensored observations. 
Formally, let N = N, + Ng, let h(x,;x.) be as in (2.1), and set 


(2.2) Vy(t) = f {1 [h(x 322) = t] df (x) dfx), t € (— 00,00). 


Thus, V,(t) is, for each t, a V-statistic with kernel I[h(x,;x.) < t]. Further let 
(2.3) Vy'(p) =inf(t: Vy(t)>p}, O<p<1. 


eh the generalized Hodges-Lehmann estimator defined above is given by 

V,, (0.5). In the absence of censoring, this is the usual Hodges—Lehmann estima- 
os Thus, the generalized Hodges-Lehmann estimator belongs in the class of 
statistics considered in Section 4 where its asymptotic distribution is obtained. 

Before concluding this section we give a proposition which shows that the 
above generalization of the Hodges-Lehmann estimator is a reasonable one. 
First, recall that Efron’s generalization of the Mann—Whitney—Wilcoxon statistic 
is given by {(1 — F,) dF, (Efron, 1967). 


PROPOSITION 2.1. Let ge (0.5) be the generalized Hodges—~Lehmann estima- 
tor as defined by (2.2) and (2.3). Then. V5 (0.5) ıs given by 


(2.4) inf t fila + t) dB(x) > 0.5}. 


622 M G. AKRITAS 


PRrooF. Clearly the T that satisfies (2.4) is the median of the convolution of 
X, and — X, when the distribution of X, is Ê, and that of X, is Ê. But this is 
suet Vy (0.5). 

Thus Vx, '(0.5) is obtained from Efron’s statistic the same way that the usual 
Hodges- Leknan estimator is obtained from the Mann-Whitney- Wilcoxon 
statistic (Hodges and Lehmann, 1963). Clearly we can obtain other estimators for 
the treatment effect in the two-sample problem under random censoring by 
inverting appropriate generalizations of other rank statistics. O 


REMARK 2.1. Padgett and Wei (1982) derived (from different context, moti- 
vation, and method) the same generalization of the Hodges-Lehmann estimator 
but they only proved a consistency result. Also Wei and Gail (1983) considered 
inversion of a class of two-sample rank tests in order to obtain rank estimates of 
the scale ratio; since their method was tailored out of Hodges and Lehmann 
(1963). the entire class of their estimators was called generalized Hodges— 
Lehmann estimators. The results of these authors, however, were derived under 
the additional assumption that the censoring variable in the second sample has 
undergone the same scale transformation as the “survival” time and thus their 
applicability may be limited. 


3. The V-empirical process. Now let h, , be the general kernel of relation- 
ship (1.2) and consider the problem of estimating the distribution of it under the 
random censoring model, that is, when the data are of the form (1.1). In the case 
of uncensored observations the problem has been treated by constructing an 
empirical distribution function associated with U-statistics corresponding to 
kernels I[h, „< t], ~oo < ¿< oo. With censored data, however, it is computa- 
tionally more convenient to consider an empirical distribution function associated 
with V-statistics. To see why, consider the special case k = 1, r = 2; then the 
U-statistic in the uncensored case assigns weight N~'(N — 1)~' to each point 
IT hy. X,,, X,,) < t], t: +J. This is the weight that the empirical distribution 
function corresponding to the whole sample X,,,..., X,, assigns to X, times 
the weight that the empirical distribution function corresponding to 
Xipe My yy Xip- Xn assigns to X,,. Thus the analogue of a U-statistic 
for censored data would require, in the general case, computing several 
Kaplan-Meier estimators. 

Consider now the general kernel as in (1.2), and set 


k 


(3.1) N= LN; Age jim (N/N), 
. N m kyr dÉ (X 
(3.2) W) = fO. f Th, < t] TT ÊX), 


(3.3) e= f". f Ths» tl TI aR (X =), 


(8,0) 


ESTIMATION UNDER RANDOM CENSORING 623 


Here F = Ê, no 8 = 1,..., k, is the Kaplan-Meier estimator of F, (see notation 
in Section 1), Tl,, , denotes the double product IT*_ ,IT/.,, and if we define 
(3.4) T, = sup{t: F(t) <1}, where F is a distribution function, 
then the numbers T,,..., T, are any numbers satisfying 
(3.5) T, < min(T;., Tg), eal ene 
We are going to study the weak convergence of the process 
(3.6) Wy(t) = N'?[VA(t) — V(t)]. 
In order to formulate the first theorem, we need additional notation. Set 
g(t) = f- Iih st] TT] Elzy), 


(4), J) 
(bi 7) (5, ¢) 


(3.7) 
x>0, s=1,...,2&, 

where the domain of integration with respect to F, 1s [0,T, ]. 
REMARK 3.1. The x that appears on the left-hand side of (3.7) corresponds to 


the (s,z)th argument of h, Thus in the absence of censoring, g, ,(x|t) = 
Plh,, = t|X,, = x]. 


Next set 
(3.8) g (xlt) = } g, (xlt), x20, s = 1,..., À, 
1=] 
and 


(3.9) L(x) = È, n (2) = NIP LRE) - K(x], s= 1, k 


It is then well known that there exists a version of ÊL, and a Gaussian process L 
such that 


& 


WL, - LI >0 as. a reer 


where || - || denotes the sup-norm. The process L(x), x = 0, is equivalent in law 
to 
1 — F(x) 
B°|K —-- 0 
[K (x) — KGa) 72% 


where B° is the Brownian bridge process on [0, 1], and K (x) = C(x)/ + C(x)) 
with C(x) = A-F) A-G) dF. 


624 M.G. AKRITAS 


Finally set 
T 
(3.10) W(t) = EAS gxt) dL,(x), 
= 0 
where Tis defined by (3.4) and (3.5). 


THEOREM 3.1. Assume that X,, as defined in (3.1), is positive for all 
s=1,...,k, and let g: |t) be of ‘bounded variation in [0, œ) uniformly in 
Fe (=o, Sco) for each s = 1,..., k (see Proposition 3.2). Let Wy(t) and W(t) be 
the processes de by (3.6) añd (3.10), respectively. Then there exists a version 
of the process Wẹ such that 


|Wy(t) - WE) 70 as N> o 


almost surely, where || - || denotes sup-norm. ~ 


Proof. Write 
= nef" a min we <J I] dF(x,,) — Il dP (zn) 


and use the formula 


N 


N N N  k-l 
la, — I [o = lap ba) IE èll [b, + (a, — b,)] 


Il 


= 
N N k-\ 

Y (a,-6,) ÍI | | | b, + terms involving (a, — 6) 
k=1 t=k+] =] 

N 
» 


a, — b,) I [b, + terms of at least second order in (a, — b,). 


We get 
[] d(x.) - I1 dF (x) 


(4,0) 


=) JI dk(x,,)d[F(x,,) - F(x.) 


(s,t) o’) 
(6), /)#(5, 0) 


+ terms of at least second order in d [F (x,,) — F.(x,,)]. 


From this and a simple argument it follows that 
(3.11) Wy(t)=.N'7 x fp ELED d[R(e) = F(2)] + ONC), 


where the process O,(-) is such that ||O,||~,. > 0 in probability. Using this and 


ESTIMATION UNDER RANDOM CENSORING 625 


Natanson (1961), page 232, we have 
|Wr(t) = We), <llOn(2) I". 








k N \i# S 
(E| Ln) -APAT Je E 
k ly N\I2 T, 
ii D Fa L; a ATL, sup TVo, ryle lt)], 
a=] & 0 














where TV, a ,; denotes the total variation in [a, b]. So it suffices to show that 
each of the terms above converges to zero a.s. This is true for the last term since 
TVo, [8C IE) <M < co for all ¢ by assumption; noting that, by (3.7) and 
(3.8), sup [ge (T,1t)] < r, the second term is easily seen to converge to zero, and 
this completes the proof of the theorem. O 


Next, in order to find the covariance p(v, t) of the process W(t), note that the 


k terms in (3.10) are independent, so that p(v, t) = L*_,A7'p,(v, t), where 
pv, t) is the covariance function of 


W,(t) = f galt) dLa) =, f "glzit) af BK, leo} 


But B°(u) =, B(u) — uB(1)}, u € [0,1], where B is the standard Brownian 
motion. Thus, 


Wit) =f" asst) — er BIK) 

T, 1— F(x) 

gu + fe(xte)B[K,(2)] d RG 
5 BO) f” g (xit) dK,(x) Haa 


= A(t) + Ay(t) — Ay(2), say. 


Direct computation gives 


COROLLARY 3.1. Let K,,s =1,..., k, be defined in connection with (3.9) and 
Á, n i= 1,2,3, s = 1,..., k, be defined by (3.12). Then under the notation and 
assumptions of Theorem 3, 1, the process Wr t) converges weakly to a mean zero 
Gaussian process with covariance function given by 


k 
(3.13) p(v,t) = ) AȚ'p (o, t), 


aul 


626 M. G. AKRITAS 


where 
p,(v,t) = EA,,(0)A,,(t) + EA,(v)A,o(t) — EA,,(v)A,3(t) 


(3.14) + EA,,(0)A,(t) + BA.(0)A,o(t) — EA,o(v)A,3(t) 


—EA,,(0)A,,(t) — EA,;(v)A,,(t) + EA,3(0)A,a(t) 
and 


BA, ()An(t) = [“e,(2l6)a(x10)D2(x) dK (2), 

EA (0) Aga(t) = S" f ELEELE) A K,(y)] aD,(2) aD,(y), 
BA,(0)A,(t) = f e210) d[K,(2)D,(x)] - [e,(xle) d[K,(x)D(x)], 
EA,(0)An(t) = f galt) f° s,( ye) DY) aK,( 9) dD,(2), 

EA, (0)Ay(t) = f ealt) d[K,(x)D,(x)] - [g,(21e)D,(x) dK (2), 


EAy()An(t) = ["g,(xl0) K(x) dD x) - [“e,(xit) d[K,(x)D(x)], 
where D, = (1 — F,)/(1 — K,) anda A b = min(a, b). 


COROLLARY 3.2. Let Vg p), 0 <p <1, be the empirical quantile process, 
where Vz (p) = inf{t: Vy(t) = p}, let V-X p) be similarly defined, and consider 
the notation and assumptions of Corollary 3.1. Then 


(a) N'“7[Vo Vz (p) -plļ, 0<p< P(o), converges weakly to a mean zero 
Gaussian process Z( p) with covariance p(V~*( p), V~(q)). 

(b) N IVZ (p) — V~\p)], 0 < p < Vy(co), converges weakly to 
Z(p)/VV~'(p)) provided the derivative V’ of V exists and is continuous on 
(0, V(00)). 


Proor. The proof follows from Corollary 3.1 and the results of Vervaat 
(1972). 

Note that in the uncensored case D, = 1 so that, for k = 1 and for T, = œ, 
formula (3.14) reduces to the formula of Theorem B, Silverman (1976), or formula 
(5) of Silverman (1983). This, however, does not constitute yet an alternative 
proof of the weak convergence of the U-empirical process G v(t) considered by 
Silverman. In order to obtain such a proof, set N, = N, r, =r and note that 
since G,(t) is, for each fixed t, the U-statistic with kernel I [A,,, < t] we have 


N. N. 
(r} r} ë 
` y(t) + £ a vs EO 


2 








V(t) c3 


Here V(t) is given by (3.2) with /,,, and no censoring, y(t) is the average of 
all terms I[[h, (X,,,..-,X,,) < t] with at least one equality i, = ig, a # B, 


ESTIMATION UNDER RANDOM CENSORING 627 


and N. 


ny = N(N — 1)...(N -r + 1). It follows that 


N r 
|N'7(%, - Gy) |= nial = ae iA - G| 


N’-N, 
< ia Sy + 1G vil] 


E PT E exii 


r 


Thus we have established 


PROPOSITION 3.1. In the uncensored case the U-empirical process Ĝ p(t) is 
asymptotically equivalent to the V-empirıcal process Ü (t). 


The next result provides a sufficient condition for g,(-|¢) to be of bounded 
variation in [0, co) uniformly in t (see Theorem 3.1). Assumption 3.1 below is also 


used in the appendix. Consider 
h, (xa) = A(X. Aipu S Creer e T T 


A E E ee E 


A aren A 


ro 


as a stochastic process in x 


$03 


perso 3.1. Almost surely [P,], where P, = FX- X 
E Xo x there exists a partition of [0,00) such that the function 
h, (Y), 0 a y = oo is monotonic within each interval of the partition for all 
t= 1,...,7,, 5 =1,..., k. Moreover there exists a positive number M, < œ such 
that the number òf itëroals in each of the above partitions ıs < M, almost 


surely [P.],t=1,...,7%, S=1,...,&. 


PROPOSITION 3.2. Under Assumption 3.1, the function g,(-|t) ts of bounded 
variation in [0, co) uniformly in t € (— œ, œ) for each s = 1,..., k. 


Proor. From (3.8) it follows that 


(3.15) TV 0, 00) l8 1£)] Ss y TVo, 0) &s, At) ]- 


t=] 


By definition, 
TVo al 18s. KE a) = sup). 
J 





gs (alt) — 8, (9-110) | 


< sup), f- LIREN < t| 
~I|h, (y=) sS t] | [] dF, (%.,.1,)1 


(Siti) 
(s ¢))# (5, 1) 


where the supremum is taken over all partitions of [0, a]. But under Assumption 


628 M.G AKRITAS 


3.1, the number of times that the process k, (y), 0 < y < œ, will cross the 
number ¢ is < M, almost surely P,. Thus 


DHIA, (3) st] -1[A,,.(y,-1) se] <M, 


for all partitions of [0, a] almost surely [ P,]. Since this is true for all a > 0, the 
result follows from (8.15). 0 


REMARK 3.2. If Tp < Te, then the sth domain of integration in (3.2) can be 
from 0 to T. = max{X,; i= 1,..., N,} while in relation (3.3) it can be from 0 to 
T, = Tp. Indeed, in this case (1 — K,)/(1 — F,) remains bounded away from zero 
and thus Theorem 1.2 in Gill (1983) implies that WL, si LIE — 0 a.s.; the rest of 
the arguments in Theorem 3.1 follow with minor adjustments. 


4. Generalized L-statistics. In this section we extend the notion of gener- 
alized L-statistics, as introduced by Serfling (1984), to the case of censored 
survival data. Recall that if X,,..., Xy are iid. F and Fy denotes the corre- 
sponding empirical distribution function, an L-statistic N~'S,Cy,8(Fy G/N) 
may be written as 


(4.1) f Jy(8)a( Fy'(s)) ds, 


where J,,(s) = Cy, for s E ((i — 1)/N,i/N], t= 1,..., N. This functional form 
of an L-statistic lends itself to generalization. In particular, if we substitute the 
empirical process considered by Silverman (1976, 1983) instead of Ê, in (4.1) we 
obtain the class of generalized L-statistics considered by Serfling (1984). And if 
we substitute the process Vy considered in Section 3 we obtain the extension of 
the class of generalized [-statistics to the case of censored survival data, which 
will be the object of study in this section. But now J,(s) is not suitable as 
defined above. Due to the fact that Vj has jumps of random size, Jy will have to 
be replaced by a function ds say, which is constant over random intervals. Also 
V (0) is not necessarily equal to one. Thus the statistic we will study is of the 
form 


(4.2) Ty = ['"'Iy(8)a( Ox 8)) d. 


To illustrate this point further, consider for simplicity the case k = 1 (one 
sample) and A(x) = x, so that Vo = fy, the Kaplan-Meier estimator. Due to the 
fact that under random censoring we end up with a random number of Ñ < N of 
uncensored observations, the construction of linear combinations of order statis- 
tics consists, in addition to choosing the weights Cy,,..., Cy, in deciding what 
weight corresponds to each uncensored observation. If we define the rank of the 
uncensored observation X, as NF,,(.X,), then we may assign to X, the weight Cy 7 
with j = [NF (X,)] ([-] denotes integer part). Note that if J,y(s) = Cy, for 
s € ((i— 1)/N,i/N], as before, then the above assignment of weights corre- 
sponds to J (s) = Jy( EÊ 1(s))) in (4.2) with Vy, replaced by Fẹ; since in the 
uncensored case (i.e, when Fẹ is the usual empirical distribution function) 


ESTIMATION UNDER RANDOM CENSORING 629 


Jn(8) = In ( Py Fy (s))) it follows that the above choice of weights is a reason- 
able one. Some other assignments of weights are discussed in Lemma 4.1. 

The purpose of this section is to study the asymptotic distribution of Ty given 
by (4.2). This will be done by adapting the method of Shorack (1972) which 
allows unbounded “scores.” Let 


Co=Vy(0+),  C, = Vy'(Vy(oo)), 
Cy = V-'(0), C, = V71(V(00)) 
and for € > 0, 8 = B(e) > 0, set 
Que = [Vu(t) < BV(t), ~œ < t< œ; Palt) = BV(t), Cy < t < œ; 
(4.4) 1 — lt) < BUl1 - V(t), -œ < t< Ĉ; 
1 — V(t) > Bll — V(t)], ~œ < t < Ê]. 


(4.3) 


PROPOSITION 4.1. Let Assumption 3.1 hold with M, = 1 (see Remark A.1 in 
the Appendix). Then for any ¢ > 0 there exists B = B(e) > 0 such that 


P(Qn-) Z Ls 
holds for all N, where Qy is defined in (4.4). 


Proor. It follows directly from Theorem A.2 of the Appendix. O 


ASSUMPTION 4.1. The function g=%°V~! is of bounded variation on 
(6,1 — @) for all 6 > 0. 


For fixed numbers b,, b, and K > 0 define a “scores bounding” function SB 
by 
(4.5) SB(s) = Ks~°(1-—s8)"" for0<s<1 
and for fixed ô > 0 define 
D(s) = Kg~V/2+%+8(1 — g)7 thti torgQ cg <1. 


Further, let J be a fixed measurable function on (0, 1). 


ASSUMPTION 4.2 (Boundedness). Assume jg| < D, |J| < SB and for all N, 
IJ | < SB almost surely. 


ASSUMPTION 4.3 (Smoothness). Except on a set of s’s of |g|-measure zero, we 
have both that J is continuous at s and Jy —> J almost surely uniformly in some 
small neighborhood of s as N —> oo. 


ASSUMPTION 4.4. The function SB[V(¢)] is g-integrable on [C,, C,] for all 
s=1,...,2. 


630 M. G. AKRITAS 


Before going into the main result of this section we will provide a result that 
helps check the assumptions DA < SB a.s. (Assumption 4.2) assuming that we 
are willing to accept a definition of Jy that depends on the choice of b,,b,. In 
particular, if J, is defined as JX}, i = 1,...,4 depending on the choice of b,, b 
(see Lemma 4.1 below), then S < SB ‘almost surely holds provided that 
Jy < SB holds. Let hp, r=(r,...,7%), be the kernel in question and let 
Jy(s) = Cy, for s E ((t — 1)/M,1/M] where M = N} '... N+, be the choice of 
weights that would have been used for the generalized L-statistic in the absence 
of censoring. 


LEMMA 4.1. Let SB be given by (4.5) and assume that |J,,| < SB where Jy is 
given above. Then 


(i) if b, > 0, ba < 0, |JM| < SB, where Js) = Jy(V,( Vy 8); 

(ii) if b, < 0, b, > 0, lJ) < SB, where JPS) = = Jy(Vy_(Vu(s))), where 
Vip. denotes the left-continuous version of 

(iii) if b, > 0, b, > 0, pi < SB, here IOs) = Jy(Vi(Vy'(s))) for s € 
[0, S] and IOS) = Jy (Vay (s))) for s E€ (S,,1), where S, is the point at 
which SB attains its minimum; 

(iv) if b, < 0, b, < 0, 30) < SB, where J{(s) = Jy(Vy_(Vy'(s))) for s € 
[0, S,] and J{(s) = Iy(Vy(VyXs))) for s € (S,,1), where S, is the point at 
which SB attains its maximum. 


Proor. (i) This follows from V,(Vy(s)) = (i — 1)/M when s e ((i — 1)/M, 
t/M | and the fact that SB is, in this case, decreasing. (11) This follows from 


V,,_(Vxy'(s)) < (i— 1)/M and the fact that SB is, in this case, increasing. (iii) 
and (iv) follow by combining (1) and (ii). O 


REMARK 4.1. If |J,| < SB, with b, < 0, there does not exist another SB*, 
b* < 0, such that |J,| < SB* with Iy(s) = In Val Ve 's))). 


THEOREM 4.1. Under Assumptions 4.1—4, 
NV*(Ty -= py) > = f(s) W(V"(8)) de(s) 
in probability, as N > oo, where 
by = | 7 Juls) (s)) ds 


PROOF. Let 


Wy(s) = — | 9 Iylu) du 


ESTIMATION UNDER RANDOM CENSORING 631 


and write 


Ty i —8(Cy)¥n(0) = fonn) dg(t), 


a Č, _ Č 
by SEIN VEDE = fi WVD BO +f, BO dYa VE), 
where C; Ĉ, are given in (4.8) and TON é,)° denotes the complement of [C,, C,) 
with respect to [C, C,]. Thus, 
Sy = NIO(Ty — By) 
~ fans) Ws dB(t) — (Yni + Yno + Yng)» 
where 


n hs) ds 
WORO 


Ni N'B(C,)[ ¥y(0) E Ya(V(Ĝ))], YN2 7 NV?8(C,)¥x(V(C,)), 


AR(t) = 





and 
fo alee n(V(t)) 


Now fix e > 0 and let 8 be as in Proposition 4.1. If X y, = Jg,(w), Assumption 4.2 

implies 

[xl SB(s) ds 
V(t) - V(t) 


where the constant C depends on e, b, b2. Thus with 
Cı à 
S=- J J(V(t)) W(t) da(t) 


X nel AA (E) sS 








Tha, é,)(4) < C-SB(V(t)) - Ire, élt), 


we have 
XwSw = Shs fx AR(t) W(t) = IVW) dgl) 


+|Yni + Yno + Ynzl- 


(4.6) 


But 
IxnAR(t)Wy(t) ~ J(V(t)) W(t) |< C- SB(V(t))|Wy(t) pe, ealt) 
+SB(V(t))| W(t), 
and, for N large, 


k 
WOLS 2 È [EATI EAT) TVo rls CONECO |, 


wols X TVo rile CÐNLONS 


632 M. G. AKRITAS 


so that 


Eve sade J(V(t)) W(t) | 
< $ SB(V(2)) © TV po fe.) lech 2.) 1o -IL.COte | 


h= ] 


+2C|L,(T,)| 5 SB(V(t)JE (T) : Tre, a(t) 
$=] 
Thus, since A%,(t) ~ J(V(t)) almost everywhere |g| (Assumption 4.3) and 
\Wy(t) ~ W(t)|| > 0, we may, by Assumption 4.4 and Proposition 3.2, apply 
Pratt’s dominated convergence theorem (Pratt, 1960) to conclude that, for each w 
the integral on the right-hand side of (4.6) converges to zero. That Ynis YN2» YN3 
are asymptotically negligible may be shown as in Shorack (1972). Hence, 


X NAN > 5 
which implies that Sy > S in probability. O 


5. Some examples. The wide applicability of generalized L-statistics is 
demonstrated by the following examples. 


EXAMPLE 5.1. Simple L-statistics. For the kernel A(x) = x (k = 1, r, = J), 
we obtain a version of the results of Sander (1975). 


EXAMPLE 5.2. fala mga estimator. For the kernel given in (2.1) 
E(x) =x, and J (5) = = (a, — Qo) fea a, (£), where ao = inf{ p: Vy ‘(P) = A 
Ñz 0.5)}, a, = sup{p: Pgp) = Vz 0.5)}, relation (4.2) gives V0.5). Note 
that the asymptotic distribution of Vy !(0.5) may also be obtained from Corollary 
3.2. 


EXAMPLE 5.3. Modified Hodges—Lehmann estimators. In the spirit of 
Serfling and Thornton (1982) we may consider modifications of the Hodges- 
Lehmann estimator for the shift in the two sample problem corresponding to 
kernels 


Myo ee Bd i es 0 / 


h. SACK ag X Vig exe's = i - 
2,7,? ( l r Fi y.) r r 
Thus in the uncensored case and for samples Xi Xn, Yo --., Yn, one esti- 
mates the shift by 
A, +- +X, Scene Aa 
med te seen Dya Ny 
r r 


For r = 1 we have the usual Hodges-Lehmann estimator. 


ESTIMATION UNDER RANDOM CENSORING 633 


EXAMPLE 5.4. A measure of spread. Bickel and Lehmann (1979) propose as 
a measure of spread the quantity med|X, — X,|. We can extend this measure of 
spread to the censored data case by taking A(x,, x3) = |x, — x,|(k = 1, 7, = 2), 
and 8, Jẹ as in Example 5.2. Again its asymptotic distribution may also be 
derived from Corollary 3.2. Moreover by Theorem 4.1 [or the corresponding result 
of Serfling (1984) for the uncensored case] we may study, as a measure of spread, 
any other linear combination of the quantiles of Vy. 


EXAMPLE 5.5. Another measure of spread. It is well known that the sample 
variance is the U-statistic corresponding to the kernel A(x, x.) = 
(x, — x,)” [cf. Serfling (1980), page 173]. In the spirit of the present paper we 
may consider, as an alternative measure of spread, the quantity med{(X, — X ee 
Thus, with the above kernel (k = 1, r, = 2), and g,d, as in Example 5.2, 
Theorem 4.1 or Corollary 3.2 will give the asymptotic distribution of this 
quantity. Again, by Theorem 4.1, we may study the asymptotic distribution of 
any other linear combination of the quantiles of Vy. 


EXAMPLE 5.6. A measure of association. It is easy to see that the sam- 
ple covariance is the U-statistic corresponding to the kernel A(x,, Yi Xo, 32) = 
(x, —3,XX_ — J); again we may form the V-empirical process and consider 
instead some combination of its quantiles. Here, however, it is the bivariate 
Kaplan-Meier estimator that is required and until recently the available results 
were inconclusive [see Campbell and Földes (1980) ]. 


EXAMPLE 5.7. Two-way ANOVA. Consider for simplicity the noninteraction 
model X,,, B Part Ee M=1,....N,, tol..,k, Jol... G, 
La, = 0, LB, = 0, and assume that R,C remain fixed while N,, tend to oo. Hall 
(1982) proposes a method for estimating the parameters that fits our formulation. 
For each choice of one observation per cell (there are N,,... Nge such choices) 
compute the average of the observations and take as an estimate fi of u the 
median of these averages. Computing, for each choice of one observation per cell 
again the average of the zth row minus the total average we obtain, for 
i= 1,..., R — 1, an estimate â, of a, by taking the median of these differences; 
for Gp take ~EP 'â,. Estimates B, of 8, 7 =1,...,C are obtained similarly. 


With k = RC it is easily seen that the estimators fi, ĉ,, 8, are the medians of the 
V-empirical processes corresponding to kernels 


B = H : ° * ry z: a aik —] 

hga, S hka.. aa E e E a ET, =k eae 
C R 

a F tel — hë B PO © a <= hE 

ki, Zee ee hk a, 1 and AP, =R Lay hka p 
y= pm | 


respectively. The extension of these estimators to the censored data case is, in the 
spirit of the present paper, straightforward. 


634 M. G. AKRITAS 


APPENDIX 


Linear bounds for the Kaplan-Meier estimator and for Vy. The pur- 
pose of this appendix is to establish linear bounds for Vy (Theorem A.2) which 
are needed in the proof of Theorem 4.1. This, however, requires linear bounds for 
the Kaplan-Meier estimator near the origin. This result (Theorem A.1) extends 
to the censored data case the corresponding result of Shorack (1972) for the usual 
empirical distribution function. Linear bounds for the upper tail of the 
Kaplan-Meier estimator have been established in Gill (1980) but the correspond- 
ing bounds near the origin remained an open problem (Gill, 1980, page 40). 

It should be mentioned that the proof of Theorem A.1 is due to a referee; the 
original proof of this result (Akritas, 1983) is based on a different argument that 
yields a (much) lengthier proof. The statement and proof of Lemma A.1 below, 
however, are contained in the original proof. 

For the statement and proof of Lemma A.1 and Theorem A.1 we will let F 
denote the Kaplan-Meier estimator based on a sample (X,, 5,),...,(4,, ôn) 
generated from a “survival” distribution F and a “censoring” distribution G; also 
we will let H (t) = fá — G_)dF, H, denote the empirical c.d.f. based on the 
random number m (m = 76,) of observations from H, = H,/H,{00), and Hy = 


m/nH,. 


LEMMA A.l. In the notation above we have 


(A.1) A, <F < H, almost surely. 


Proor. Let S, denote the largest uncensored observation and X,,,, = 
max{X,,..., 4,}. We will first show the right inequality in (A.1). 


CASE 1. S, = Xn Note that both F and H, assign positive mass only on 
the uncensored observations and that the jumps of F are increasing (that is, if 
X, < X, are both uncensored, the mass that F assigns to X , 18 greater than or 
equal to the mass it assigns to X,). Next it is easy to see that the mass that F 
assigns to S, (= smallest uncensored observation) is always less than or equal to 
the mass that A, assigns to S,. This means that FX,) = H(X,) can happen 
only when X, = S, or when it happens that the smallest n — m observations are 
all censored and the m largest are the uncensored observations in which case 
F = ff. In all o‘he. cases F < Ĥ,. 


CasE2. S, < X, Note first that F(S,) < 1 = H,(S,). If we now relabel the 
largest observation as uncensored and S, as censcred, the new Kaplan-Meier 
estimator will assign the same mass as F to all uncensored X ’s that are less than 
S,. Thus by Case 1 F < H, on [0,S,) and thus the proof of the right inequality in 
relation (A.1) is complete. The left inequality in (A.1) follows easily by noting 
that H P F is largest (when we interpret 0/0 as 0) at S, and in particular when 


ESTIMATION UNDER RANDOM CENSORING 635 


S, equals min{ X,,..., X,,} in which case F = m/nĤ,. This completes the proof 
of the lemma. O 


THEOREM A.1. Given e > 0 there exists B = B(e) so that 














Aye 
(A.2) ale >B"'| se 
and 

Fj” 
A.3 P > B7! ‘ 
(A.3) lz Ai |<: 











where S, (S,) is the largest (smallest) uncensored observation. 


Proor. From the definitions of H) and H, (given right before Lemma A.1) 
we have 


[1 — G(t, —)| F < H,on[0, to], H,(oo)H, < Fon [0, œ). 


Thus, using Lemma A.1 we have ||F/F\|S* < ||, /( H,(co)H,)\|S" which (condi- 
tionally on m and hence unconditionally too) is O,(1) uniformly in n, giving 
relation (A.2). Similarly, considering only the interval [0,¢,] which is easily 
shown to be sufficient, ||F/F'||% < |\(Ho/[1 - G(t) — ))/Holl = O,(1) giving rela- 
tion (A.3). O 


We are now ready to present the linear bounds for Vi: 


THEOREM A.2, Let Assumption 3.1 hold with M,=1 (see the remark 
following the proof ). Then for any e > 0 there exists a B = B(e) > 0 such that 


(A.A) Pli- Vy s B71 — V) on (-0,6,]] 21-8, 
(A.5) P{1— Vy z= B(L— V) on(-«,¢,]] >1-e, 
(A.6) P| Vy < B7'Von (— œ, œ) | >l-e, 
and 

(A.7) P| Vy = BVon [G,,0)| >21 -e 


hold for all N, where Ô, C, are given in (4.3). 


PROOF. We will show only relation (A.4); the other relations are established 
similarly. In what follows []* will denote the product TT, with (s,, J) # (s, i). 
Note that if V(oo) <1 the result holds trivially so we will assume V(oo) = 1 
where V is defined in (3.3) with T, = min(T,., Tg ) = Ty, s = 1,..., k (see Re- 
mark 3.2); also we will set T, to be the maximum of the uncensored observations. 


636 M. G. AKRITAS 


We have 
BOUL Ja V(t)) = (1 = V,,(t)) ue pf" Wes fra = IL hy, < tD TE dF (x) 


-fe f O- Ean < D) TT dÊlen) -1 + TIEI?) 


(8,1) 
=f? e [PO = Ehn s D| FTE Eeu) ~ TI den) 
0 0 (3,2) (8, t) 


(5,1) 
where d = [X£ _;1r,] }. Thus, using the formula for the difference of products 
given in the proof of Theorem 3.1 we get 


B-M(1 = V(t)) - (1 = P(t) 
= D fhe f O- ha, s IV dhlan) 


(6,£) 


(A.8) xd[B-4F(x,,) - F(x,,)] 


+ integrals involving d | B~?F,(x,,) — F.(x,;)| in at least second order 
+ terms of order (1 — F,(T.)). 


But by Assumption 3.1 with M, = 1, each of the terms in the sum on the 
right-hand side of (A.8) will be either of the form 


f” ee fS URO) a F(y)|TI* dF, (x, ,) 
or of the form 


[Po [PUET -RT - LBE) - BT dlen), 


where y depends on x, „(8,, J) # (s, 1) and on t. But Theorem A.1 and Theorem 
3.2.1 in Gill (1980) imply that both forms of integrals above are positive with high 
probability. Integrals involving d[B~¢F(x,,) — F(x st)] in at least second order 
may also be shown to be positive with a high probability. Finally there are 
negative terms of order 1 — F(T); these, however, converge to zero at least as 
fast as the positive terms and thus B can be chosen so the whole expression is 
positive with high probability. 


REMARK A.l. The requirement that in Assumption 3.1 M, = 1 is somewhat 
restrictive. It is clear from the proof of Theorem A.2 that without this require- 
ment one would have to use linear bounds for the Kaplan-Meier estimator 
indexed by intervals. In the general case such results are not available (and 
indeed not true) even for the usual empirical distribution function. However, 
Theorem A.2 may be proved if instead of M, = 1 one requires that there exists a 
ô > 0 such that all the intervals in each of the partitions described in Assumption 
3.1 are of length greater than ô almost surely [P,], s = 1,..., k. 


ESTIMATION UNDER RANDOM CENSORING 637 


REFERENCES 


AKRITAS, M. G. (1983). Linear bounds for the Kaplan-Meier estimator. Unpublished manuscript. 

BICKEL, P. J. and LEHMANN, E. L. (1979). Descriptive statistics for nonparametric models IV. 
Spread. In Contributions to Statistics (J. Hajek Memorial Volume) (J. Jurečková, ed.). 

CAMPBELL, G. and FÖLDES, A. (1980). Large-sample properties of nonparametric bivariate 
estimators with censored data. Technical Report 80-10, Dept. of Statistics, Purdue Univ. 

EFRON, B. (1967). The two sample problem with censored data. Proc. Fifth Berkeley Symp. Math. 
Statist. Prob. 4 831-853. Univ. California Press. 

GILL, R. D. (1980). Censoring and Stochastc Integrals. Mathematical Centre Tracts 124, Mathema- 
tisch Centrum, Amsterdam. 

GILL, R D. (1983). Large sample behaviour of the product-limit estimator on the whole line. Ann. 
Statist. 11 49-58. 

HALL, W. J. (1982). Personal communication. 

HARDY, G. H. (1952). A Course of Pure Mathematics. 10th ed. Cambridge University Press. 

HODGES, J. L., JR. and LEHMANN, E. L. (1963). Estimates of location based on rank tests. Ann. 
Math. Statsst. 34 598-611. 

HOEFFDING, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. 
Statıst. 19 293-325. 

KAPLAN, E. L. and MEIER, P. (1958). Nonparametric estimation from incomplete observations. d. 
Amer. Statist. Assoc. 53 457—481. 

NATANSON, I. P. (1961). Theory of Functions of a Real Variable 1. Ungar, New York. 

PADGETT, W. J. and WEI, L. J. (1982). Estimation of the ratio of scale parameters ın the two sample 
problem with arbitrary right censorship. Biometrika 69 252-256. 

PRATT, J. W. (1960). On interchanging limits and integrals. Ann. Math. Statist. 31 74-77 

SANDER, J. M. (1975). Asymptotic normality of linear combinations of functions of order statistics 
with censored data. Technical Report 8, Division of Biostatistics, Stanford Univ. 

SERFLING, R. J. (1980). Approxunation Theorems of Mathematical Statistics. Wiley, New York. 

SERFLING, R. J. (1984). Generalized L-, M- and R-statistics. Ann Statist. 12 76-86 

SERFLING, R. J. and THORNTON, D. H. (1982). An extension of Bahadur’s representation of sample 
quantiles, with applications to versions of the Hodges—l.ehmann location estimator 
Technical Report 352, Johns Hopkins Univ. 

SHORACK, G. R. (1972). Functions of order statistics. Ann. Math. Statist. 43 412-427. 

SILVERMAN, B. W. (1976). Limit theorems for dissociated random variables Adv. tn Appl. Probab 8 
806-819. 

SILVERMAN, B. W. (1983). Convergence of a class of empirical distnbution functions of dependent 
random variables. Ann. Probab. 11 745-751. 

VERVAAT, W. (1972). Functional central limit theorems for processes with positive drift and their 
inverses. Z. Wahrsch. verw. Gebiete 23 245-253. 

WEI, L J. and GAIL, M. H. (1983). Nonparametric estimation for a scale-change model with 
censored observations. J. Amer. Statist. Assoc. 78 382-388. 


DEPARTMENT OF STATISTICS 
PENNSYLVANIA STATE UNIVERSITY 
UNIVERSITY PARK, PENNSYLVANIA 16802 


The Annals of Statistics 
19h, Vol 14, No 2, 638-647 


CONDITIONAL EMPIRICAL PROCESSES 


By WINFRIED STUTE 
University of Giessen 


We prove a Donsker-type invariance principle for a nearest-neighbor-type 
conditional empirical process. As an application we show asymptotic normal- 
ity of conditional quantiles and derive large-sample distribution-free tests and 
confidence bands for a conditional distribution function. 


1. Introduction and main results. Let (X,Y) be a random vector in R!*¢ 
with distribution function H. For real Y (i.e, d= 1) with E(|Y]) < œ write 
E(Y|X) = mo X, with m(x) = E(Y|X = x) denoting the regression function of Y 
at X = x. Assume that (X,, Yi), (X2, Y2), ... is a sequence of independent ran- 
dom vectors with the same distribution as (X, Y). Much work has been devoted 
to the problem of (nonparametric) estimation of m when only little information 
on H is available. See Collomb (1981) for a survey. 

For a general d, replacing Y by the indicator function liy.,,, y E R%, we 
might apply the existing results for statistical inference about the conditional 
distribution function 

m(y|x)=P(Y<y|X=x), (x,y) © R'*? 
at a fixed point y € R. As in the case of unconditional distribution functions, 
such a result is insufficient for most purposes. For example, when dealing with 
smooth functionals of m(-|x), it is necessary to handle estimates m,(-|x) = 
mal; Xi Yi -ees Xa Yn) of m(-|x) as a function rather than its value at a 
single point. In other words, it is desirable to study the distributional character 
of the process {m,(y|x): y E R°}. 

In Stute (1984b) we introduced a nearest-neighbor-type estimate of m(x), 
which turned out to be asymptotically normal under minimal assumptions on H. 
To be explicit, let d = 1 and write, for n > 1, 


F(x) =n! 2: arene eee xER, 
t= 1 


the empirical distribution function of X,,..., X,. Let K be a smooth probability 
kernel with bounded support and put, for some bandwidth a, > 0, 


2 a F(X,) | 


r 


wedel) Y IE 


t=] 


Under some mild growth conditions on a, (~ 0) it was shown that 


(na,,)'”*[m,(x9) — m(x)] + N(0, o?) 


Received July 1984; revised March 1986. 

AMS 1980 subject classificahons. Primary 60F17, 62J02; secondary 62G05, 62G10, 62G15. 

Key words and phrases. Conditional empirical distribution function, invariance principle, condi- 
tional quantiles. 


638 


CONDITIONAL EMPIRICAL PROCESSES 639 


in distribution, where 
o° = var(Y|X = xo) {K*(u) du. 


As indicated above, replacing Y, by 1 oY, we obtain the process 


F(X) E F(X,) | 


n 


(—-œ,y] 


n 

m,(ylxo) = (nan) È toam% K| y ER, 
i=] 

as an estimate of m(-|x,). 

The main result of this paper states that when viewed as a random element in 
a suitable (topological) space of functions, (na,)'/7[m,(-|x 9) — m(-|x9)] > By in 
distribution, where B, is a certain Gaussian process (depending on x,). In other 
words, we prove a Donsker-type invariance principle for the conditional process 
m,(y|xo), y ER". 

Observe that, since K is a probability kernel, m,(y|x,.) is nonnegative and 
“nondecreasing” in y. It is not a proper distribution function, however, since in 
general the weights K([F.(x,) — F,(X,)]/a,)/na,, 1 <1 < n, do not sum up to 
one. In the next section, we shall propose a modification of m,, which turns out 
to be a proper distribution function with the same asymptotic behavior as m,. 

In the following write Y = (Y',..., Y7), and denote with G’, 1 <j < d, the 
marginal distribution function of Y7. Define G: R? > [0,1]? by G(y,,..., Yy) ‘= 
(G y,),...,G%y,)), and let F be the distribution function of X. We then have 


H(x, Jise- Ya) = C( F(x), Gy), @(yu)), 


where C is the copula function of H, a distribution function on [0,1]'+? with 
uniform marginals. Similarly, for the empirical distribution function H,„ of 
(X,, Y,),...,(X,, ¥,) we may write 


H, (x, Ys- Ya) = C,( F(x), Gy), 6 yy), 


where C, is the empirical distribution function of an i.i.d. sequence with distribu- 
tion function C. The possibility of obtaining H and H, from the “uniform” 
processes C and C, by means of the transformation (F, G',...,G%) € [0,1]! +? is 
Important for deriving statements about multivariate empirical processes on 
R'*¢ from corresponding processes on [0,1]'*? with uniform marginals. More- 
over, since the weights K([P.(x,) — #,(X,)|/a,,) have the nice property of 
depending only on the order statistics and ranks of X,,..., X„, we may write 


F(x) = F(x) |a Cray 


m,(y|xXo) = az! Pia lt 
=e = F(x) 


a 


n 


= a5" [inog Jo dx, du) 


= m,(G(y)|F(xo)), 


where F, is the first marginal distribution of C,, an empirical distribution 
pertaining to an i.i.d. sample with uniform distribution. Consequently, in order to 


640 W. STUTE 


derive distributional results for m,, we may and do assume that H has uniform 
marginals. 

Throughout this paper assume that K is a twice continuously differentiable 
probability kernel vanishing outside some finite interval. (a,,),, will be a sequence 
of bandwidths converging to zero at appropriate rates. 

While K and a, are at the statistician’s disposal, the invariance principle may 
be proved only under an additional smoothness assumption on the unknown H 
(resp. m). Recall that for m, we may assume w.l.o.g. that F is the uniform 
distribution on [0,1]. 


ASSUMPTION (A). Assume that 


sup |m(t|x) — m(s|x)|= o((Ind7' a as § > 0 
iit- si] < 


uniformly in a neighborhood of xy. 


Clearly (A) is satisfied whenever m is Hölder continuous of some positive 
order. No existence of densities is required. (A) also guarantees that m is 
equicontinuous in a neighborhood of x,. This is quite natural in view of the fact 
that the standardized process m, is expected tc have a limit process with 
continuous sample paths. Now, for y € [0,1]?, put 


Lis 





m,(y|x,) = a, fasik] |H(ae, du). 


a n 


Recall F= U[0,1], the uniform distribution on [0,1], and observe that, by 
definition of m(-|x), 





Fig ¥ ty) = az' f'm(yle)K| F] de, 
a smoothed version of m(y|x,). 

To state our first main result, we denote wita D[0,1]¢ the space of all 
‘““right-continuous” functions on [0,1]? with “left-hand” limits; cf. Billingsley 
(1968) for d = 1 and Neuhaus (1971) for a general d. Endow D[0,1]% with the 
Skorokhod topology, and let #(D) be the generated Borel o field. Clearly, m,, is 
a random element in (D, @(D)), so its distribution is well-defined. 


THEOREM 1. Assume that H has uniform marginals, and let a,, — 0 be such 
that na? > œ. Under (A) we then have for Lebesgue-almost all 0 < x} < 1 


(na) [m Cix) — 7,(-|x9)] > Bp = B(x) in distribution. 


Here B, ts a centered Gaussian process on [0,1]! with continuous sample paths 
vanishing at the lower boundary of [0,1]? and covariance 


cov( By(y,), Bo(¥a)) = [MY A yalxo) — m(Yilxo)m(y2lxo)] | K ?(u) du. 


CONDITIONAL EMPIRICAL PROCESSES 641 


In other words, B, is a scaled tied-down Brownian sheet with intensity 
measure m(-|x,). When X is independent of Y, m(-|x,) = Qy, the distribution of 
Y for all x,. Hence up to a scaling factor, Bọ is equal to the limit of the 
unconditional empirical process pertaining to the Y sequence, as should be 
expected. Observe, however, that the standardizing factor is (na,)'”* with 
a,, > 0, indicating a lower rate of convergence. This is the price one has to pay ~~. 
when making inference about conditional (local) quantities. N 

It is not hard to prove that the standardized processes m,(-|x) converge 
jointly in distribution to B,(x) even at finitely many points x = x,,...,x,, with 
B(x)... Bo(x,) being independent. 

The corresponding invariance principle for (na, )'/*[m,,(-|x 9) — m(-|xo)] may 
be obtained under an additional smoothness condition on m(y|x) as a function of 
x. This is necessary in order to guarantee that m, — m — 0 at a satisfactory rate. 


ASSUMPTION (B). For each y m(y| - ) is twice continuously differentiable in 
a neighborhood U of xy, such that 


sup sup |m’(y|x)| < oo. 
xEU y 


COROLLARY 2. Under the conditions of the theorem, assume that (B) holds, 
and let K be such that fuK(u)du = 0. Whenever na? > 0 we have for 
Lebesgue-almost all 0 < x < 1 


(na,,)'”*[m,(-|xo) — m(-|x9)| > By in distribution. 


Proor. According to the theorem it remains to show that (na,)'/"[m,(y|xo) 
— m(y|x,)] > 0 uniformly in y. Because of nað > 0, it suffices to prove 
m, — m = O(a“). This follows, however, in much the same way as the corollary 
in Stute (1984b). O 


With the same method of proof, one may also treat the optimal choice of a 
bandwidth, namely na® > ¢ > 0. For a general c, the limit process is equal to the 
noncentered Gaussian process 

B: y > B,(y) + u*K(u) du. 
Clearly, for Bj, c > 0, to be continuous, we also need continuity of m’(-|x,). As 
for the usual empirical process, the invariance principle for the conditional 
empirical process may be used to test the hypothesis m(-|x.) = m,(-|x,)) and to 
determine confidence bands for m(-|x,). For example, when d=1 and G is 
continuous, we have (when c = 0) 


Yem” (y|xo) 
os 


(na,,)'”” sup |m,,(¥|xo) — m( ylxo)| 


=(na,)'” sup |m,(ulF(x9)) — m(ulF(x9))| 


Susi 


> sup |B,(u|F(x,))| in distribution. 
O<sust 


642 W. STUTE 


Here m, and m are the processes pertaining to the “uniform” C, and C. 
Observe, however, that for continuous m(-|F(x)) 


sup. | Bo(ulF(%0))| = y [RK*(u) du sup |Be(u)| 


in distribution, where B* is a standard Brownian bridge on [0,1]. As a conse- 
quence, we see that the Kolmogorov—Smirnov test statistic leads to large-sample 
distribution-free tests and confidence bands for m(-|x,). 


2. A proper conditional empirical process. As mentioned earlier, m,, is 
not a proper distribution function. Alternatively, we might consider the function 


3 PERR , x| F (xo) i F(X,) | 
m*(y|x9) = ——— = ~ 
£ p| : F,(X,) | 


a proper distribution function. Observe that 
m*(y|x) Pa m,(Y|Xo)/fn( Xo), 


where 


eRe Cre SE 


i= | 


ae sie) 


In other words, /,(x)) = m,(x,) with Y, = 1. Since for such a Y one has 
m(x) = 1 and var(Y|X = x) = 0 we obtain that 


(na) I f,(x9) — 1] + 0 in probability. 
It follows that under the smoothness assumptions of the theorem 
(na,)'”*[m#(y|x) = my |Xo)| ii (na,)’*[m,(y|xo) > 7. ,(¥|Xq)| 
+op(1) uniformly in y. 


We thus see that m* fulfills the same invariance principle as m,„. 


3. Conditional quantiles. When d= 1, i.e, when Y is real-valued, the 
process m* has an inverse or quantile function 


m*~'(ulx,) = inffy ER: m*( yix) >u}, OK<u<1. 


This is scheduled for estimating the u quantile of m(-|x,). In this section we 
derive the limit distribution of 


Q,(u) = (na) [mt {(ujxo) — m7 (ulxy)], 0< u< 1 fixed. 


For such an u, write y, = m~'(u|x9). 


CONDITIONAL EMPIRICAL PROCESSES 643 


THEOREM 3. Under the assumptions of the corollary, if m’'(y,|X9) = 
(d/dy)m( yxy) > 0 at y = Y, and G is continuous we have for almost all x, 
Q,(u) > N(0,02) in distribution, 


where 


o? = u(1 — u) [K%(x) dx/[m’(y,1x0)]”. 


Proof. The method is the same as for showing asymptotic normality of 
(unconditional) quantiles. [See, e.g., Wretman (1978).] Compared with Wretman’s 
proof, we use C-tightness of m* rather than Chebyshev’s inequality (because of 
the heavy dependence of the summands) to show that when 


W,* = (na,) [m4( y,|x9) ~ m( Yulxo)] 
and 


W, = (na,)'7| mx(y, + y/(na,)'7 1x9) — m( yu + y/(nan) ixo )], 


then W* — W, — 0 in probability. The theorem then immediately follows from 
asymptotic normality of W,* and continuity of the standard normal distribution 
function. See Wretman (1978) for details. O 


4. Lemmas and proofs. Put 


Ba) == B,A¥|Xo) = (na,,)'"[m,(y|xo) m.(y |x). 
We shall prove the theorem by showing that: 
(i) the finite-dimensional distributions of 8, converge to those of Bp. 
(ii) {8,: n = 1} is uniformly C-tight, i.e., for each € > 0 and every p > 0 there 
exist some 6 > 0 and n, € N such that for all n = no 


P| sup | Bay) Bä B,C¥2) |= p) S E. 
ly: = yall s8 


Observe that 8 (0) = 0 for each n € N. As to (i) we have the following 


LEMMA 4. Under the assumptions of the theorem, the finite-dimensional 
distributions of 8, converge to those of B}. 


PrRoor. Follows at once from the theorem in Stute (1984b) upon applying the 
Cramér—Wold device. 0 


To prove tightness we shall have to rest on some bounds for the oscillation 
modulus of multivariate empirical processes. For this, let a(1),..., a(1 + d) 
be some positive constants (which may depend on n) and put a = (a(l1),..., 
a(l + d)). For x = (x,,...,%:449) S Y =(%)---, Yia) Componentwise denote 
with I, y = IT2%(x,, y,] the pertinent rectangle in R'*?. In Stute (1984a) we 
derived finite sample upper bounds and almost sure limit results for maximal 








644 W. STUTE 


deviations of the empirical process 


GA EE T = n Cae Tisa) <= oC cee Xia) 
over small rectangles. To be specific let 


w,(a) = sup{ |a,( I.) |: ly —x,<a(i)forl<i<1+ d} 


denote the oscillation modulus of a,. Then it was shown in Theorem 1.7 of that 
paper that under certain growth assumptions on a(1),..., a(l + d) 


+d) 
(1) P(w,(a)>s) < cf min „2l Ji exp| — C,s”/min a(i)| 
stislt+ 
for some C,, C, not depending on a, s, n, or H. 
To study conditional empirical processes at a point x, € R, we shall have to 
restrict ourselves to rectangles J, , for which x, < x9 < yı. Write 


w, (8; Xo)? ee sup{ |a,(Z x,y yi: a x,| 
<a(i)forl<is<1+dandx, 25, 29) 
and put (with u denoting the distribution pertaining to H) 


y(a; Xo) i sup{ (J, ,)} 


with the supremum extended over the class of rectangles appearing in w,(a; 9). 
To motivate Lemma 5 below, we should like to mention that (1) had been derived 
by bounding w,(a) from above by the maximal deviation ofa, over a finite 
number of small rectangles I,,...,I,, forming a partition of [0,1]'+?, with 
the length of each side being of the order min, ~,<;4qga(t). Hence m ~ 
(min, <,<14ga(i)]7"*™. After that, an appropriate maximal inequality together 
with a standard Bernstein exponential bound applied to a,|J,, 1 <j < m, then 
yielded the desired bound (1). In the case of w,(a; xo), since x, < xg < y, for all 
rectangles in question, it suffices to partition the coordinate space T]!*4[0, 1] into 
small rectangles with each side having length of order min, .,.,,,a(i). From 
this observation it is likely to obtain a bound for w,(a; xo) similar to (1), but with 
[min, -,<14¢a(t))"*” replaced by the smaller factor [min, .,.,,,a(i)]~%. As 
remarked after Theorem 1.7 in Stute (1984a), the bound (1) may be improved if 
some further information on H is available, e.g., if H has a bounded Lebesgue 
density. In fact, the denominator min, .,.,, ,a(t) in the exponential factor 
occurs when applying the Bernstein bound by observing that for each J, , with 
ly, — x,| < a(i) we have p(I, y) < const xII, tda(i). Noting that, by definition, 
y(a; x) is a general upper bound for (I, y) we thus obtain: 


LEMMA 5. Suppose that H has uniform marginals. Then there exist con- 
stants C,, C, > 0 (not depending on s£, n, a, or H) such that 


(2) P(w,(a;x,)>8)<C | min a(i)| “exp| ~ C,s?/y(a; xo)l, 


2<sisit+d 


provided that 2 < syn and C,y(a; xo) > 8/ Yn , Cy finite. 


CONDITIONAL EMPIRICAL PROCESSES 645 


We shall apply Lemma 5 to vectors a = (a(1),..., a(1 + d)), where a(1) = 
a, — 0 at appropriate rates and min, .,.,, g@(t) = ô > 0 is small but fixed. 


LEMMA 6. {B,: n 2 1} is uniformly C-tight. 
Proor. Write 


B,(y|xo) = ynan [m,(ylxo) — m*(y|xo)] + na, [m*(y|xo) — m,(y|xo)] 


= Bilylxo) + Bnl lxo), 
where 


mž(ylzo) = a3’ f Ipo,y(u)K 


We show that both 8, and no; n > 1, are uniformly C-tight. As to B„ı, we have, 
upon integrating by parts, 


Ez a F(x) |n(a du). 


F (x9) a | 
m,(y|xo) — m&(y|xo) = a7 '[ HCL, y) - nay) 


an 


-a7 [|H (x,y) - H(x,y)|K„(dx) 

with 
K(x) = r| 
Since F(x) x (0 < x < 1) with probability one, a, > 0 and K has finite 


support, the first summand is zero with probability one for all n > n,(w), say, 
not depending on y. Similarly, for n = n,(w) 


[UHAx.y) - H(x,y) 1K, (dx) 


F(x) — F,(x) | 


n 


z [UHAzy) a H(x,y) an H (x,y) + H(xo,y)] K,,(dx). 


Assume K = 0 outside (—1,1) w.log., ie, the last integral remains 
unchanged when restricting the domain of integration to those x’s for which 
|F (x5) — E(x)| < a, For given e > 0 the Dvoretzky—Kiefer—Wolfowitz (1956) 
bound entails that for some finite (large) C, one has, up to an event of probability 
less than or equal to e, that 


|F(x9) — F(x)| <a, + Cyn"? < Ca, 
whenever |F (x) — F(x)| < a,. Denoting with || K || the total variation of K we 
thus obtain for all large n, neglecting an event of probability < e, that 


d 
sup |Bm (lto) Baalo) |< az PIENE o4( Cag 15., 8,1)-+-515 20). 
ly: ~~ Yelis2 t=] 


646 W. STUTE 


To bound the last sum, we may apply Lemma 5 with s = pjan by observing that 
y(C,a,,1,...,6,...,1; X0) = C,a,5, so that the growth conditions are satisfied 
for at least all large n. For such an n 


P(w,(Cyaqs1,---,,1,...,1; X0) > pyan) < C,8~“exp[-C,p7a,,/7]. 


By (A), y = 0(a,/ln6~') as 60 and n > œ. Hence the last exponential 
bound can be made arbitrarily small for all ê < 6, and n = no, say. This proves 
tightness of £,,,. 

As to „o, write 


Hi ~ F(x) |x as aa 


m*(y|xo) = a? { 1, y\(a)K 
+a,” f1py(u)[F(%o) — Fa() - F(xo) + F(%0)] 
F( xy) a F(x) 


a, 


xx |Ha, du) 
+a5°f1poy(4)[Fa(to) — F(x) — Flo) + F(x)]” 
xK A) /2H(dx, du) 
=m,(y|xo) + L(y, n) + hly, n) 
with A between a> '[F(x,) — F(x)] and a7 '[ F(x.) — F(x)]. Similar to Lemma 


1 of Stute (1984b) we get that (na, )'/*I,{ y, n) > 0 in probability uniformly in y. 
Thus to prove the lemma it remains to show that 


(Jna, I(-,n): n> 1) is uniformly C-tight. 
With a,(x) = n' [F (x) —x],0 <x < 1, we have 


pna, aly, n) = 0,97 fmlyle)[anlzo) = a(o] 





a 


J 





= a*^ {[mlylx) - m(ybeo)] glo) - al2)] K F | ax 


n 


+a,°?m(y ity) f [an(to) = a,(2)] K | E] ae 


Use the same arguments as in the proof of Lemma 3 in Stute (1984b) to show 
that the first summand converges to zero in probability uniformly in y whenever 
m(-|x) is equicontinuous in a neighborhood of x. Finally, for large n, 

Xo 


az” mylo) f lanlo) ~ anla) |=) a 


Xo X 





- -az '#mlylzo) [K| = — Jayde), 


n 


a 


CONDITIONAL EMPIRICAL PROCESSES 647 


Since a7 fK [(x — x)/a,]Ja,(dx) has a normal limit distribution and is hence 
stochastically bounded, and since m(-|x,.) is (uniformly) continuous, this proves 
tightness of Bo 0 


5. Concluding remark. It is possible to extend the results of this paper to 
multivariate X. We found it useful, however, to separate the univariate from the 
general case. In fact, regarding the distribution of X, our processes m,, (resp. m*) 
turned out to be distribution-free. For multivariate X, the transformations 
involved lead to processes with underlying uniform marginals, but otherwise 
depending on the (joint) distribution of X. 


REFERENCES 


BILLINGSLEY, P. (1968). Convergence of Probability Measures. Wiley, New York. 

CoLLOMB, G. (1981). Estimation non-paramétrique de la regression: Revue bibliographique. Internat. 
Statist. Rev. 49 75—93. 

DVORETZEY, A., KIEFER, J. and WoLFOWTITZ, J. (1956). Asymptotic minimax character of the sample 
distribution function and of the classical multinomial estimator. Ann. Math. Statist. 27 
642-669. 

NEUHAUS, G. (1971). On weak convergence of stochastic processes with multidimensional time 
parameter. Ann. Math. Statıst. 42 1285-1295. 

STUTE, W. (1984a). The oscillation behavior of empirical processes: The multivariate case, Ann. 
Probab. 12 361-379. 

STUTE, W. (1984b). Asymptotic normality of nearest neighbor regression function estimates. Ann. 
Statıst. 12 917-926. oF 

WRETMAN, J. (1978). A simple derivation of the asymptotic distribution of a sample quantile, Scand. 
J. Statist. § 123-124. 


MATHEMATICAL INSTITUTE 
UNIVERSITY OF GIESSEN 


ence 


The Annals of Statisti» 
198, Vol. 14, No 2, 648-664 


LARGE DEVIATIONS OF ESTIMATORS! 


By A. D. M. KESTER AND W. C. M. KALLENBERG 
University of Limburg, Maastricht and Twente University of Technology 


The performance of a sequence of estimators {T,} of g(@} can be 
measured by its inaccuracy rate —lim inf, n~ log P,(||T,, — g(9)|| > e). 
For fixed e > 0 optimality of consistent estimators wrt the maccuracy rate is 
investigated. It is shown that for exponential families in standard representa- 
tion with a convex parameter space the maximum likelihood estimator is 
optimal. If the parameter space is not convex, which occurs for instance in 
curved exponential families, in general no optimal estimator exists. 

For the location problem the inaccuracy rate of M-estimators is estab- 
lished. If the underlying density is sufficiently smooth an optimal M-estima- 
tor is obtained within the class of translation equivariant estimators. 

Tail-behaviour of location estimators is studied. A connection is made 
between gross error and inaccuracy rate optimality. 


1. Introduction. Let Z bea set of points x and # a o-field of subsets of F. 
The parameter space © is an index set of points 0 and for each @ € ©, P, is a 
probability measure on #. Let X,, X,... be a sequence of iji.d. random vari- 
ables, each defined on X. The distribution of S = (X,, X»,...) is denoted by Py, 
0 = O. Let g be a mapping of the abstract space © into R? and let {T,} denote a 
sequence of estimators of g(@), where T, is based on n observations. Note that T, 
takes values in g(9) only. The performance of {T} is measured by its inaccuracy 
rate 


(11) ee, 6, {7,}) = — lim inf? log P, (IIT, — @(4)II > e); 


the larger the inaccuracy rate the better the estimator. Due to the fact that the 
large deviation probabilities involved are hard to handle inaccuracy rates have 
been discussed mainly for e > 0. See for example Bahadur, Gupta and Zabell 
(1980) and Fu (1982 and references therein). In this paper however, the inaccu- 
racy rate is investigated for fixed e > 0. 

Two main themes are considered: 


(i) optimality of consistent estimators; 
(11) the inaccuracy rate of M-estimators for the location problem. 


To investigate optimality of a sequence of consistent estimators the inaccuracy 
rate of the sequence is compared with an upper bound of (1.1). As usual in large 


Received December 1984; revised June 19865. 

''This research was done while the authors were affiliated with the Free University at Amsterdam. 

AMS 1980 subject classifications. Primary 62F10; secondary 60F10. 

Key words and phrases. Large deviations, inaccuracy rate, exponential convexity, maximum 
likehhood estimator, M-estimator, translation equivariance, tail-behaviour. 


648 


LARGE DEVIATIONS OF ESTIMATORS 649 


deviation theory the upper bound is obtained essentially by application of the 
Neyman-—Pearson lemma; cf. Bahadur et al. (1980). 


PROPOSITION 1.1 (Bahadur). Jf {T„} is a consistent estimator of g(@) for 
each 0 = O, then 


(1.2) e(e,6,{T,}) < ble, 0), 


where b(e, 0) = inf{K(n,@): n € 9,\|g(n) — g(@)|| > e} and K is the 
Kullback—Letbler information 


K(n,0) = al cli when P, « Ps, 
oG 


(1.3) 
otherwise. 


In view of Proposition 1.1 a sequence of estimators {T,} is called optimal wrt 
the inaccuracy rate ([R-optimal) at @ for e > 0 if {T,} is a consistent estimator 
of g(@) for each 0 € 8 and 


(1.4) gi lim n`’ log Po (IIT, — g(9)|| > £) = b(e, o). 


Note that IR-optimality at 6, depends on © in two ways: Bahadur’s bound 
b(£, 0) depends on @ and {T} has to be a consistent estimator of g(@) for each 
0 € O; cf. Proposition 1.1. The important role of Kullback-Leibler information in 
large deviation theory is apparent from the following simple but useful proposi- 
tion of Bahadur (1980, 1983 Section 2), which states that IR-optimality of 
{K(T,,@,)} as an estimator of K(@,4,) yields IR-optimality of {g(T,)} as an 
estimator of g(@). 


PROPOSITION 1.2 (Bahadur). Jf g is continuous and {T,} is a consistent 
estimator of @ for each 8 € © such that for each b < by = babo) 


(1.5) — lim sup n`: log P, (K(T,, 0o) = b) = b, 
then {g(T,,)} is an inaccuracy rate optimal estimator of g(@) at 0, for each e > 0 
with b(£,0,) < bo. 


It is well-known that the likelihood ratio test is often an optimal test in a large 
deviation context. One might guess that the maximum likelihood estimator 
(MLE) plays a similar role in large deviation estimation theory. It turns out that 
for exponential families in standard representation with a convex parameter 
space MLE’s are indeed [R-optimal. Exponential convexity (cf. Section 2) is here 
the key point. As soon as exponential convexity fails, (1.5) cannot hold true for all 
6, and b which are of interest; cf. Lemma 2.4. This occurs for instance in curved 
exponential families. 

In Section 3 shift families {P,} = { p(x — 0): 0 € R} are considered. Here p is 
a Lebesgue density. Only in some exceptional cases shift families are exponen- 
tially convex. We may therefore expect that [R-optimal estimators usually do not 
exist. Sievers (1978) came to the same conclusion, be it apparently on a more 
empirical basis. 


650 A. D. M. KESTER AND W. C. M. KALLENBERG 


In shift families it is natural to restrict attention to (translation) equivariant 
estimators. When p(x — ¢)/p(x + e) is nondecreasing in x, Sievers (1978) ob- 
tained an upper bound for the inaccuracy rate of equivariant estimators by 
application of the Neyman-Pearson lemma. Remarkably, Sievers’ bound can be 
higher than Bahadur’s bound. The reason is that Sievers’ bound concerns 
equivariant estimators, which are not necessarily consistent; cf. Example 3.1. 
However, when p is symmetric and sufficiently regular, Sievers’ bound is not 
larger than Bahadur’s; cf. Kester (1985), page 71, and Fu (1985). 

For a wider class of shift families we derive in Section 3 an upper bound, which 
coincides with Sievers’ bound when p satisfies his condition. When p is suff- 
ciently smooth and e is small enough (but fixed), an M-estimator is constructed 
which attains this bound. This result (Theorem 3.4) provides optimal equivariant 
estimators even for such densities as the Cauchy density; cf. Example 3.2. In 
contrast to Sievers, who applies a finite sample result, we employ a typical large 
deviation approach. It is also shown that in the double exponential family a 
trimmed mean attains Sievers’ bound. 

Apart from investigating optimality of equivariant estimators the inaccuracy 
rate of quite general M-estimators is obtained. 

In the last section tail-behaviour of location estimators is discussed, mainly for 
a fixed sample size. Jureckova (1979) has shown that the sample mean has a 
certain gross error optimality property when the distribution of the observations 
has exponentially decreasing tails. Here it is shown that the inaccuracy rate 
optimal estimator for a fixed error, say £, converges to the sample mean when 
e€ — œ, i.e. when gross errors come in. 


2. Exponential families, exponential convexity. Let {P,: 8 = ©} be a 
k-parameter exponential family in standard representation given by its densities 
wrt a o-finite measure u on R* 


(2.1) dP,(x) = exp{0’x-—¥(8)}du(x), xER*,6€Oc O*cR’, 


where @ is a subset of ©* = {0 e R*: fexp(O’x) du(x) < œ} and 4(0)= 
log fexp(4’x) du(x), 6 € 6*. Here 0’x denotes the inner product of @ and x. 
Assume without loss of generality that the covariance matrix of P, is nonsingular 
for 8 € int O9*. Define 9, = {80 e O*: E |X| < oo}, then intO* c @, c O* 
and the mapping A: @ > E,X, is 1-1 on 8; cf. Berk (1972). The likelihood of a 
sample X,,..., X, i8 maximized over ©* at the point 


(2.2) 6* = A(X.) 


when X, € A = {A(6): 0 € 8,}. Noting that the Kullback—Leibler information 
K(»,8@) = (n — OYA) — y(n) + Y8) when n € 8, and @ €e ©*, it is seen that 
maximizing the likelihood over a subset © of ©* is equivalent to minimizing 
K(X.) 9) = K(ĝ*, 0) over 0 € @ when X, € A; cf. Efron (1978). If it exists 
the unique point (n) minimizing K(y, 0) over 0 € © is called the Kullback- 
Leibler projection of ņ on O. Thus when X, € A and § exists at À` (Xn), 


(2.3) 6, = 6(8*) 


LARGE DEVIATIONS OF ESTIMATORS 651 


is the MLE of @ on @. Before stating a lemma which establishes existence of the 
MLE we introduce the “Kullback—Leibler distance” K(@) of the boundary of 6* 
to 8 by 


(2.4) K(6@) = sup{a: {n: K(n,0) < a} € C, C int @*, where C, is compact}. 


LEMMA 2.1. Let © be a relatively closed convex subset of ©* and let 
7 E€ int®*. If K(n,@)< K(@) for some 6€8, then the Kullback—Leibler 
projection 6(7) exists; thus 6 exists when 


(2.5) F, eA( U (a: K(n,8) < K(6)}). 


Moreover, if K(n, 8) < K(@) for some 6 € O, then K(8(n), 0) < K(n, 9) and 
hence 6(n) € int ®*. 


Proor. Let 7 € int @* and 6 € © satisfy K(n, @) < K(0). On the compact 
set © A {f e O*: K(f, 8) < K(m, @)} the infimum of K(y,- ) is attained at 7, 
say. Consider $ € © satisfying K(é,@) > K(y,@) and let ô> 0. Define $, = 
ag + (1 — a) and let 0 < @ < 1 be such that K(n, £) < K(n, £) + 6. Note that 
{f.:0<a < & Cc int 6*. Now K(7,: ) attains its infimum on the compact set 
{$x 0 <a < &} at $», say. Since K(n, t + (1 — ¢)@), 0 < t< 1, is minimal! for 
t= 1, its derivative is nonpositive at t= 1: (a ~ 0)(A(é,+) ~ A(n)) < 0. In 
combination with K(Eq«, 9) — K(n, 9) = (Eq — OYO) — AC) — Kn, £ar) it 
follows that €,..€@A {{E0*: K($,0) < K(m,@)} and hence K(n,7) < 
K((n, £a) < K(n, £) < K(y, £) + 6. Since 6 > 0 was arbitrarily chosen we have 
K(n, 7) < K(n, §), implying that the infimum of K(n, -) on @ is attained at r. 
Unicity of 6(7) = 7 follows from the convexity of @ and the strict convexity of 
K(9,:). O 


One of the main results of the paper, optimality of MLE’s in convex exponen- 
tial families, is presented in the following theorem. 


THEOREM 2.2. Suppose @ is a convex relatively closed subset of int ©*, 
where @* is the full parameter space of an exponential family in standard 
representation. Let 8, = g(T,,) where T, = 8, whenever the MLE 6, of 0 exists. 
If g is continuous then {8,} is inaccuracy rate optimal at @ for each ¢>0 
satisfying 
(2.6) . | b(e,@) < K(@). 


Before proving Theorem 2.2 we mention the following special case. 
COROLLARY 2.3. Let © = ©* be open. Suppose that 
(2.7) (x ER: sup {0x — ¥(8)} < co} is open, 
geg” 


then {ĝ*} is inaccuracy rate optimal for all e > 0 and 0 € @*. 


652 A. D. M. KESTER AND W. C. M. KALLENBERG 


Proor. Apply Theorem 2.2 with O = @*, g(@)=8@ and note that (2.7) 
implies K(@) = œ for each 0 € ©*; cf. Kourouklis (1984). O 


PROOF OF THEOREM 2.2. Since © is locally compact and © C int @%, it 
follows by Theorem 3.1 and Corollary 3.3 of Berk (1972) that {@,} and hence {T,} 
is a consistent estimator of 8 for each 6 € 8. Let 8 € © and e > 0 satisfy (2.6). 
For each b < K(§@) we have 


(2.8) P,( K(T,,0) = b) < P,(A~'(X,) € A, K(T,, 0) = b) + P(X, € A(A)) 


with A = {n © @*: K(n,0) < b}. By Lemma 2.1 \~'(X,) € A implies that 
6 exists and K(T,,0) = K(6,,0) < K(6*,0) = K(A~(X,), 0) <b. It follows 
that the first term in the right-hand side of (2.8) equals zero. The second term 
equals exp{ -nb + o(n)} as n > œ by Theorem 6 of Efron and Truax (1968). 
Application of Proposition 1.2 completes the proof. O 


REMARK 2.1. At first sight one might guess that [R-optimality of (Ô, \ can be 
proved as follows: Define g(@) = 6(@), show that g is continuous, prove directly 
I[R-optimality of (6*} in the full parameter space ©* and apply Proposition 1.2. 
As a result one obtains 


— lim n7' log P,(||9, — 8l] > ©) = inf{K(n, 0): n E O*,1]6(n) — 0l > £}, 
while one has to show 
— lim n7' logP, (||, — 0l] > e) = inf{E (n, 8): n € O,||n — 8l] > e}. 


(Bahadur’s bound depends on the parameter space; cf. Section 1.) 

In general the second infimum is larger than the first one. Under the condition 
that @ is a relatively cosed convex subset of @* it is seen by Lemma 2.1 that 
both infima are equal. This argument is also used in the preceding proof applying 
the crucial inequality K(ĝ,, 0) < K(@*, 0). Convexity is the key point as is 
further elaborated in a wider context in the rest of this section. 


In general one cannot expect [R-optimality of the MLE when @ is not convex. 
To explain this we need the concept of exponential convexity. 

We return to the general framework of Section 1. 

Let {P,: 0 € ©} be the class of all probability measures on (Z, #). For 
7, 9 € © denote by dP, and dP, the densities of P, and P, wrt a dominating 
measure u. The family {P,,,): y(a) € 9, a € [0,1]} between P, and P, is defined 
by its u-densities 


dP, 
(2.9) dP rala) = exp| alog (2) ~ p” *( x) dP(x)lap,>o)(*); 
n 
where y™°(a) is a normalizing constant. Further for n, 0 € © we denote by 


T(n, @) the set {y(a): a € [0,1]}, where y(a) is determined by (2.9). A family 
{P;:0 € ©} is called exponentially convex when ņ, 0 € © implies ['(y, 8) C ©. 


LARGE DEVIATIONS OF ESTIMATORS 603 


When P, Py are members of an exponential family (P,:@ € ©}, the family 
between P, and P, is a linear subfamily of (Py: 0 € 8*}: F(n, 8) = {af + 
(1 ~ ajn: a € [0,1]}. Note that here ['(y, 80) c © for all n, @ € © iff © is convex. 
So if the parameter space of an exponential family is convex, the family is 
exponentially convex. 

Furthermore we define for 7, @ € © the number C(n, 0) by 


(2.10) C(n, 6) = inf{max[ K(f, n), K(§,0)]: ¢ € 8}. 


This number is strongly related to the Chernoff index; cf. (2.15) and Chernoff 
(1952). When restricted to a subfamily © of © we define 


(2.11) Co(n, 0) = inf{max[ K(f, 7),K(s, 4]: § € @}. 


The following lemma indicates that (1.5) cannot hold for each 6, € © when 
(P,: 0 € O} is not exponentially convex. 


LEMMA 2.4. (i) If n,@ € © satisfy 
(2.12) C(n, 8) < Co(n, 0) 


then for each estimator {T,} of @ and each b with C(n,@) < b < Co(n, 0) 
condition (1.5) fails at least at one of the points y and 8. 

(ii) Jf{P,: @ € O} is exponentially convex then C(n, 6) = Co(n, 9) for all 
7,0€E 0. 

(il) Jf (Py: @€ ©} is closed in total variation and {P,;: 0 E O} is not 
exponentially convex, then there are 4, @ € © such that (2.12) holds. 


Proor. (i) Let C(n, 8) < b < Co(n, 9) and choose č € © such that 
max{K({, 7), K(f, 8)} < b. Since T, takes values in © only we have 
P.(max{K(T,, n), K(T,, 9)} > b) = 1 for each n, implying 

limsup P,(K(T,,7)>6)>0 or lmesup P,(K(T,, 6) > 6) > 0. 
wa x nmr OO 
A slight modification of Theorem 2.1 in Bahadur et al. (1980) yields the result. 

(ii) Without loss of generality assume C(7,#) < œ or, equivalently, 
(dP dP, > 0) > 0. Noting that a > K(y(a@),) and a > K(y(a), 0) are con- 
tinuous and monotone, an & exists which minimizes max[ K(y(qa), n), K(y(a), @)] 
over a € [0,1]. The probability measure P a is unique. We shall prove that for 
each ¢ € O with K(f, 7) and K(f, 0) finite 


(2.13) max[K({,7), K(f, 0@)] = K(&, y(&)) + max[K(7(&), n), K(7(&), 6)]. 
Let {EO with K($, n) and K(f, 6) finite, then P, « P a. Without loss of 


yay 


generality assume K(f, 4) > K(f, 8). In view of (2.9) we have 
K(§,n) — K(, y(@)) — K(y(@), n) 
(2.14) = &[ K($,n) — K(S,8) — {K(y(4), n) — K(y(&), 8)}] 
> ã{K(y(ã), 0) — K(y(@),n)}. 
First assume & = 0, then K(y(0), n) = K(y(0), 0) and (2.14) implies (2.13). In case 


654 A. D. M. KESTER AND W.C.M KALLENBERG 


0<a@<1 we have K(y(@), n) = K(y(&), 0) and (2.14) implies (2.13). If @ = 1, 
then K(y(1), @) = K(y(1), n) and again (2.14) implies (2.13). Having established 
(2.13) it follows that the infimum in (2.10) is attained at the unique probability 
measure P iay 

Since exponential convexity implies y(&) € O, the proof of (ii) is complete. 

(iii) Let {P,: 0 € @} be not exponentially convex. Then there exist n*, 0* € @ 
such that IF(n*,0*)— © is nonempty. The set {a: y*(a) € ©}, where y* is 
associated with I(n*,0*), is closed because K(-,-) is continuous on I'(y*, 6*) 
and K({,, y*(a)) > 0 implies convergence in total variation of P, to Praca cf. 
Pinsker (1964) page 20. Now let a,, a. be the endpoints of the (a) largest open 
(relative to [0,1]) interval U with {y*(a): a E€ U} A © = Ø. Define 


P*a When a, > 0, 


P, = Po when a, = 0 and Pao, € O, 
P., when a, = 0 and P «o € O, 


y 


and P, similarly for a, Ppap and Py. Note that y(ã)¢ ©, where y is 
associated with T(n, 8), and that C(n, 6) < œ since I(q*,0*) = Ø. Noting that 
the infimum in (2.10) is attained at the unique probability measure Pap 
combination of (2.11) and (2.13) yields 


Coln, 9) = inf (K(f, y(a)): § © O} + Cn, 0) > Cin, 0), 


because K({,, y(&)) > 0 with ¢, € © implies that P, converges in total varia- 
tion to Pa) and hence that y(@) € ©, which is a contradiction. This completes 
the proof of the lemma. O 


REMARK 2.2. When {P;: 0 € ©} is not exponentially convex, closed in total 
variation, and connected, usually there exist 7,@ € © with C(n, @) arbitrarily 
small, thereby refuting (1.5) for arbitrarily small values of b. 


REMARK 2.3. Consider a curved exponential family with statistical curvature 
unequal to zero. Although it may be possible to obtain an [R-optimal estimator 
for each e > 0 at a fixed 6, or for a fixed £ > 0 at each 0 € © (cf. Examples 3.6 
and 3.7 in Kester, 1985, pages 41, 42), IR-optimal estimators for a class of e’s and 
6’s usually do not exist, due to the fact that a curved exponential family is not 
exponentially convex unless the curve is a straight line in the natural parameter 
space. 


REMARK 2.4. Let 7, 8 = © besuch that C(n, @) < 00, i.e., pl dP, dP, > 0) > 0. 
The function Ņ™? as defined in (2.9) is convex and continuous on [0,1]. Moreover 
we have for all 0 < a < 1 


paene = eae ena) 
da’ (a) 3 y(a) og dP, z yaj 7 yia > . 


LARGE DEVIATIONS OF ESTIMATORS 655 


Inspection of the proof of Lemma 2.4 (ii) now yields 


(2.15) C(n, 6) = -¥%a@) = - , inf W (a) = -log inf f dPy dP," dp. 


EXAMPLE 2.1. (i) Let {P,: 0 € O} be the class of all probability measures 
having a positive and continuous Lebesgue density on R and let g map @ onto 
the median of P,. Note that {P,: 0 € @} is exponentially convex. Bahadur et al. 
(1980) proved that the sample median is [R-optimal for all @ € © and e > 0. 

(ii) Let { Py: 0 € O} and g as in (i), but with the restriction that P, is 
symmetric about g(#). This class is not exponentially convex. Indeed, Example 
2.2 in Kester (1985, pages 27-30) shows that [R-optimal estimators do not exist 
in this example as already presumed in Bahadur et al. (1980); this parallels the 
known fact (cf. Pfanzagl (1976)) that the sample median is not optimal wrt the 
asymptotic variance in this class. 


3. Shift families. In this section let {P,: 0 €R} be a shift family of 
probability measures on R with Lebesgue densities 


(3.1) p(x) = p(x- 8), x,0ER, 


and let g(9) = @. 

Only in some exceptional cases shift families are exponentially convex. We may 
therefore expect that usually Bahadur’s bound is not attained. On the other hand 
translation equivariance is a natural restriction for location estimators in shift 
families. In the following lemma an upper bound for the inaccuracy rate of 
equivariant estimators is derived. This result generalizes previcus work of Sievers 
(1978). Note that for equivariant estimators the inaccuracy rata is independent of 
0; it is denoted by e(e, {T,}). 


LEMMA 3.1. If T, is equivariant then e(e, {T,}) < C(—«, e); cf. (2.10). 
REMARK 3.1. By (2.15) we have 


(3.2) C(—e, £) = —log : inf [pte — e)p! (x + e) dx, 


which is the expression for the bound in Sievers (1978). The bound C(—e, e) will 
be called Sievers’ bound. 


Proor or Lemma 3.1. Let { € © satisfy K({, —e) < œ and K(Ẹ, e) < œ; 
then P, « P; hence P, has a Lebesgue density. The equivariance of {T ,} now 
implies P,(T, = 0) = 0 since the Lebesgue measure of the same event is zero. 

Let {n,} be a subsequence such that lim, _, „n; ‘log P,(|T,,| > e) = —e(¢, {T,,}). 
If limsup,P,(T,, > 0) > 0, there exists a subsequence {m,} of {n,} such that 
lim ,P,(7,,, > 0) > 0 and hence by equivariance and Theorem 2.1 in Bahadur et 


656 A. D. M. KESTER AND W. C. M. KALLENBERG 


al. (1980) 
— e(e, {T,}) = lim mọ "log P,(|T,,,| > e) = liminf m; ‘log Po( Ty, >e) 
(3.3) = liminf m; 'logP _ (T, > 0) > -K(¢, —e) 
> —max{K({, ~e), K(f, «)}. 


If lim sup,P,(7,, > 0) = 0, then lim,P,(T,, < 0) = 1 > 0 and hence 
— e(e, {T,}) = lim m; ‘log P,(|Z,,| > £) = liminf m7! logP,(T,,, < —e) 


(3.4) = liminf m7 'logP (Tn < 0) = —K(f, e) 
> —max{K({, -e), K(f, «)}. 


Since { € © is arbitrarily chosen, combination of (3.3) and (3.4) yields the result. 
g 


In contrast to Sievers’ claim (1978, page 611) Bahadur’s bound can be less than 
Sievers’ bound as is shown by the following example. 


EXAMPLE 3.1. Let p(x) = e7*lyo (x), then (Pj: 8 € R} defined by (3.1) is 
the exponential shift family. Bahadur’s bound b(e) = e and this bound is at- 
tained by min{X,: 1 < i< n}; Sievers’ bound C(—e, £) = 2e and this bound is 
attained by min{ X,: 1 < i < n} — e, which estimator is not consistent. Examples 
to the same effect have been given by Kester (1985, page 62) and Fu (1985). 


Our next aim is to derive the inaccuracy rate for a wide class of M-estimators 
and to investigate at which of these estimators Sievers’ bound is attained. An 
M-estimator is defined here as a suitable zero or change of sign of 


A(t) = L Y(X,- t), 
i=] 
where y is a function into the extended real line which attains positive as well as 
negative values, but not both — oo and + oo. We consider two classes of functions 
y requiring either 
(3.5) Ņ is nondecreasing 
or 


W is bounded, continuous, and such that A, has at least one zero 
(3.6) for each n{ Po]. 


The condition on A, holds when x(x) is nonnegative for |x| large enough. When 
W satisfies (3.5) the M-estimator {T,} = {T{®} is defined by 


(3.7) T, = sup{t: A,(¢) = 0}. 


LARGE DEVIATIONS OF ESTIMATORS 657 


When y satisfies (3.6), {T,} is defined by 
"IE t* whent®’—-M,<M,-t, 
ie) n\t- when t*-M,>M,-t, 


where 
t* = inf{t: t > M,,A,(t) = 0}, 
t = sup{t: t < M,,A,(t) = 0}, 
and where M, = X;,/2}:n i8 the sample median. Note that definitions (3.7) and 
(3.8) render {T,,} translation equivariant. 
The inaccuracy rates of these estimators involve the log-moment generating 
functions of y(X) under P, and P_,; we define 


po(7) = log fe) dP, (x) 
and the quantity e,(e) by 
e,(e) = min{ — inf p_,(r),— inf p(7)}. 
T20 r<0 
In the following two theorems the inaccuracy rate of M-estimators is determined. 


THEOREM 3.2. Let y satisfy (3.5) and let {T_} be dejined by (3.7). If 
P(y(X,) < 0) > 0 or P(4(X,) = 0) = 0 then 


(3.9) é(e,{7,}) = e,(e). 


Since this result is very similar to Theorem 2 in Rubin and Rukhin (1983), the 
proof of Theorem 3.2, which is essentially an application of Chernoff’s theorem, is 
omitted. 


THEOREM 3.3. Assume that p is positive in a neighbourhood of 0 and that 
P,((— 00,0)) = 4}. If y satisfies (3.6) and is moreover continuously differentiable 
with bounded derivative such that |\)'(x) — Y y)| < cjx — y| for some c < œ and 
all x, y E R, and such that {W’(x)p(x) dx > 0, then for each 0 < e < ep 


e(e, {T,,}) = e,(e), 
where {T} is defined by (3.8). 


REMARK 3.2. Rubin and Rukhin (1983) remark that for the MLE of the 
Cauchy shift family (3.9) does not hold. Noting that this MLE is obtained as the 
M-estimator with (x) = 2x(1 + x*)~!, Theorem 3.3 implies however that (3.9) 
does hold in this case when e is sufficiently small. Together with Theorem 3.4 (cf. 
Example 3.2) this also provides an answer to the open problem mentioned in Fu 
(1985, Remark 2). 


PROOF OF THEOREM 3.3. Let c, = {pdx > 0. Since the Lipschitz condition 
on y is “inherited” by n~'X,, we have writing ô = $c™'c, 


(3.10) n-'N,(0) < -te = n7'N,(t) <0 on(—5, 68). 


658 A. D. M. KESTER AND W. C. M. KALLENBERG 
Let O <e < $6. If |T | >£, A,(—e) > 0, A (£) < 0, and X (t) < 0 on (—ô, ô) 
then |M,| > +8; hence we obtain 
P,(|Z,,| > €) < Po(A,(—e) < OorA,(e) = 0) + PIM] = 16) 
ey) +P (nN (0) > —4e,). 
By Chernoffs theorem and translation equivariance we have 
(3.12) — lim n~"logP)(A,(—e) < OorA,(e) = 0) = e,(e). 


Writing p, = P,((46, 00)), p_ = P,(— œ, — +8)) it is readily seen (cf. Example 
6.1 in Bahadur (1971)) that 


— lim n~ log P (IM, > 18) = min{ —} log[4p, (1 —p.)|, 
3.13 ae 


—4logl4p_(1 - p_)]} > 0. 
The derivative of 


T> log | e" =°/dp dx 
at + = 0 equals tc, > 0, and hence 


— inf log fe” ~°/4n dx = c, > 0. 


TsO 
By Chernoff’s theorem it follows that 
(3.14) — lim n“logP,(n-'¥,(0) = —4e,) = c> 0. 


Since w is bounded and continuous, p(T) is continuous in @ and r by dominated 
convergence. Moreover, by strict convexity of pọ we have p,(7) > 0 for each 7 > 0 
or p,(t) > 0 for each r <0. Without loss of generality assume the latter; 
by pointwise convergence of p, to pọ and convexity of p, it follows that 
inf, <9P,T) > 0 as e > 0, implying 


(3.15) ee) > 0 ase 0. 


Combination of (3.11), (3.12), (3.13), (3.14), and (8.15) yields that there exists 
e, > 0 such that for each 0 < e < e 


— limsup n` ‘log P,(|T,| > £) > e,(e). 


On the other hand 
P,(\T,| > £) = Po(|T,| = £) = Po(A,(-e) < OorA,(e) > 0) 
—Py(n-'X,(0) = — $e) 
and hence there exists £} < & such that for each 0 < e€ < ge, 


— liminfn~ "log P,(|T,| > £) < e,(e).0 


LARGE DEVIATIONS OF ESTIMATORS 659 


REMARK 3.3. The only property of the sample median M, we need in the 
above proof is (3.13). Therefore if we define {T,} by (3.8) using another pre- 
liminary estimator {M,„} which satisfies 


— lim n7logP,(jM,| > 8) > 0 
for each ô > 0, Theorem 3.3 holds for {T,,}. 


Having established an expression for the inaccuracy rete of M-estimators 
(Theorems 3.2 and 3.3) and an upper bound for the inaccuracy rate of equivariant 
estimators (Lemma 3.1) the question arises naturally which, if any, of the 
M-estimators attains Sievers’ bound. In vew of (3.2) Sievers’ bound C(—e, e) can 
be written as 


p(x — e) ; 
-log i DZ | p(x + eda. 
log ai fess| a log p(x +e) ne ve 
Define 
(3.16) y (x)= lo p(x — e) 
l j PTE +e) 


We assume that either y, < œ ae. or p, > —œ ae. If », > —œ ae. then 
inf, <oP{T) = info<,<;p_{7) and hence e,(e) = C(—e, e). If y, < œ ae. then 

inf, . op- (7) = infy.,<,0{7) and hence eye) = C(— 2, 2). Therefore as a rule 
the M-estimator based on y, given by (3.16) is inaccuracy rate optimal within the 
class of equivariant estimators. For instance, when y, is nondecreasing and either 
> — %0 a.e. or < œ% a.e., indeed {T{*)} attains Sievers’ bound; cf. Sievers (1978), 
Theorem 2.1 and Fu (1985). Note that in general the optimal M-estimator will 
depend on €. 

For a nonmonotone , we cannot apply Theorem 3.3 directly, since in general 
€o will depend on = y,; so for a fixed e*, say, €} = €o(,.) may be smaller than 
e*. Nevertheless the next theorem states that if p is sufficiently smooth (3.9) 
holds for y = y, when e is small enough, with the important implication that 
Sievers’ bound is attainable in these situations. Even for the Cauchy shift family 
an inaccuracy rate optimal M-estimator is obtained by this result; cf. Example 
3.2. 


THEOREM 3.4. Let p>0O on R with [° 7 pdx=4. If p is three times 
differentiable such that p(x) > 0 (< 0) for each small (large) enough x, such 
that the first three derivatives of log p are bounded and such that 

[(og p)"pdx <0, 
then 
(3.17) e(e, {Ti?}) = e,(e) = C(-e, e) 


when e ts small enough. 


660 A. D. M. KESTER AND W. C. M. KALLENBERG 


PROOF. For e > 0 the function y, strongly depends on e. Therefore we define 
the “standardized” yp, = (2e)'y,; then obviously T{*? = T{*®. It is easily 
checked that y, satisfies (3.6). Further we have 


ez(e) = e,(e) = C(-e,e) 0 ase 0. 
Noting that 


lim [dp de = —f(log p)"pdx = c, > 0 


and 


liminf — inf log | e™*-%/2p dx 


e—0 TSO 


> — inf lim log fe". dx 


TsO e—0 
= — inf log | e™ (8 P)"— 4/2) dx > 0, 
r<0 


we can follow for 1, the same line of argument as in the proof of Theorem 3.3, 
using the mean value theorem to bound y4, ¥,, and 47. We omit further details. 
U 
EXAMPLE 3.2. Consider the Cauchy density p(x) = m7 (1 + x*)7'. We have 
1 +(x +e) 
1+(x- e) l 


The conditions of Theorem 3.4 hold, hence {T{*2} attains Sievers’ bound when e 
is sufficiently small. 


y(x) = log 


It is seen in the proof of Lemma 2.4(i1) that C(—e, £) = 
max{K(y(&), —e), K(y(@), €)}. It can be shown that in regular cases e(e{T,,}) = 
C(—e, e) implies that the influence curve at P a of the estimator {T,,} is a.e. 
proportional to y, given by (3.16), and hence {T‘*-)} is the essentially unique 
M-estimator which attains Sievers’ bound; cf. Theorem 4.7 in Kester (1985), page 
76. However, possibly also an L-estimator attains Sievers’ bound. The next 
example shows that this occurs in the double exponential family, where a 
trimmed mean attains Sievers’ bound. 


EXAMPLE 3.3. Let p(x) = Je~*'. Sievers’ bound C(—e, £) = e — log(1 + e) is 
attained by the M-estimator {T,’} with y, given by 


—2e when x< —e, 
W(x) = (2x when |x| <e, 
2e when x > €; 
cf. Sievers (1978). The probability measure P,a 1s given by 


dP aala) _ ic +e) lexp{—|x| +e} when |x| >e, 
dx 


(1 +e). when |x| < e. 


LARGE DEVIATIONS OF ESTIMATORS 661 


The influence curve at P a of the }(1 + e)” '-trimmed mean L,, say, is a.e. 


proportional to the influence curve of {7,%}. So this L-estimator is a candidate 
for being optimal. By symmetry the inaccuracy rate of {L,} equals 


— lim n“'logP,(L, > e). 
n= © 
In view of Theorem 6.3 and formula (6.13) of Groeneboom et al. (1979) this can be 
expressed as (writing a = }(1 + 2)7') 
2aloga + (1 — 2a)log(1 — 2a) 


(3.18) + inf{ sup f(a, b,t): -œ <a<b<w,b> e) 
t20 
with 
; : 
a,bo,t)=(1-— 2a [te -10 e*n(x) dx 
Gn f(a, b, t) = (1 — 2a)|te ~ log | e”p(x) de 


—afllog F(a) + log(1 — F(b))]. 
where F denotes the distribution function of P,. We shall show that /(0, 2¢,1) 
attains the sup and inf in (3.18). Consider 
f(a, 6,1) — f(0,2e,1) = (1 — 2a)| -log [pe de 4: log f°"} a 
a 0 


—allog F(a) — log + + log(1 — F(®)) — log 4e~**]. 
Multiplying by (1 + e)/e = (1 — 2a)~' we obtain, when a < 0 and b> e, 


er 


2 


= -o| e | e —[a- TA 


2e 2 





1 
log e — log E | 





1 
> —[e**- 2a — 1] 20, 
4e 


where the inequality log x < x — 1 was used. When a > 0 we find in a similar 
way that 
f(a,b,1) 2 f(O, 2e, 1). 
It follows that 
inf sup f(a, b, t) = f(0,2e,1) 
a,b t>0 


and it suffices to remark that f(0,2e,1) > f(0,2e,t) by symmetry and convexity 
of 
2e 
t> | ef- dy, 
j, 

It remains to evaluate f(0,2e,1). Together with the other part of (3.18) this 
indeed equals e — log(1 + e). So the 4(1 + e)~*-trimmed mean attains Sievers’ 
bound. 


662 A. D. M. KESTER AND W. C. M. KALLENBERG 


4. Tail-behaviour of location estimators. The asymptotic behaviour as 
€ > œ of 


— log P,(|T7,| > e) 
—log P(|X,| > e) 
is proposed by Jurečková (1979, 1981) as a basis for comparison of translation 


equivariant estimators T, in location families { p(x) = p(x — 0): x,8 € R}. For 
symmetric p and translation equivariant estimators T, satisfying 


min{X,:1<i<n}>0=>T,(X,..., X, ,)}> 0, 
max{X:l<i<n}<0=> T(X,..., X,) <0, 
it holds that 
(4.2) ls< lim inf B(e, T,) < limsup B(e, T,) < n; 


eo 


(4.1) B(e, T,,) = 


cf. Jurečková (1979, 1981). Moreover, if 
— log P,((e, co 
lim ee a(t ) = ] 


e> oO be” 


(4.3) 


for some b > 0, r > 1, then the sample mean attains the upper bound in (4.2), 
i.e., 

(4.4) lim B(e, X,) =n; 

cf. Jurečková (1979). In this section we connect the results of Jurečková and the 
inaccuracy rates as treated in the previous section. 

First consider just as above a fixed sample size n. Suppose that wx) = 
log{ p(x — ©)/p(x + e)} is nondecreasing; then T'*-) minimizes P,(|T,,| > £) over 
the class of translation equivariant estimators; cf. Huber (1968). We prove that 
under a similar condition as (4.3) the optimal estimator T’ converges pointwise 
tox, aS E > œ. 


THEOREM 4.1. If p is log concave and satisfies 


(4.5) — log p(x) = b|x|"(1 + f(x) 

with b> 0, r > 1, f(x) > 0 as |x| > œ and for each x 

(4.6) f(x-—e) =f(xte)+o(e') ase œ, 

then 
lim T%(x,,...,x,) =n} x, foreach (x,,...,x,) E R”. 
nee t=] 


The proof hinges on the following lemma. 


LEMMA 4.2. Under the conditions (4.5) and (4.6) 
W(x) =2bre’ 'x(1 + 0(1)) ase œ. 


LARGE DEVIATIONS OF ESTIMATORS 663 


Proor, Fix x E€ R. As € > œ we obtain 
~(x)= —b(e—x)"(1+ f(x te) + o(e')) + b(e +x) (1 + f(x + €)) 
= b{(e +x) — (e—x)"}(1 + o(1)) + o(e""*) = 2brxe”™ “(1 + o(1)). 
CJ 


PROOF OF THEOREM 4.1. Fix 6 > 0 and (x,,...,x,) E R”. By Lemma 4.2 we 
have, writing X = n` E2, 


Siles r-a ¥ (x, - F—8)(1 + o(1)) 


pm] r= i 
= —2bre™ 'n8(1 + o(1)) < 0 
and similarly ©7_ W(x, — X + ô) > 0 when e is sufficiently large, implying 
[TO(x,,..5,X,) — X| < 6. a 


Next consider the double exponential distribution. For each 0 < e < œ the 
‘(1 + e)~'-trimmed mean minimizes among translation equivariant estimators 
the inaccuracy rate —liminf, „n~ ‘log P,(|T,,| > ¢);cf. Example 3.3. This esti- 
mator converges to the sample mean if e > œ and to the sample median if 
e > 0, thus providing a bridge between gross error optimality and strictly local 
optimality. 


Acknowledgment. The authors are much indebted to J. Oosterhoff for his 
stimulating advice and his continuous interest during the preparation of this 


paper. 


REFERENCES 


BAHADUR, R. R. (1980). On large deviations of maximum likelihood and related estimates. Technical 
Report No. 121, Univ. of Chicago. 

BAHADUR, R. R. (1983). Large deviations of the maxamum likelihood estimate in the Markov chain 
case. In Recent Advances in Statistics (M. H. Rizvi, J. S. Rustagi and D. Siegmund, eds.) 
273-286. Academic, New York. 

BAHADUR, R. R., Gupta, J.C. and ZABELL, S. L. (1980). Large deviations, tests and estimates. In 
Asymptotic Theory of Statistical Tests and Estimation (I.M. Chakravarti, ed.) 33-64. 
Academic, New York. 

BERK, R. H. (1972). Consistency and asymptotic normality of MLE’s for exponential models. Ann. 
Math. Statst. 43 193-204. 

CHERNOFYF, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on sums of 
observations. Ann. Math. Statist. 23 493-507. 

EFRON, B. (1978). The geometry of exponential farmlies. Ann. Statist. 6 362-376. 

EFRON, B. and TRuAx, D. (1968). Large deviations theory in exponential families. Ann. Math. 
Staust. 39 1402-1424, 

Fu, J. C. (1982). Large sample point estimation: A large deviation theory approach. Ann. Statist. 10 
762-771. 

Fu, J. C. (1985). On exponential rates of likelihood ratio estimators for location parameters. Statist. 
Probab. Lett. 3 101-105. 

GROENEBOOM, P., OOSTERHOFF, J. and RUYMGAART, F. H. (1979). Large deviation theorems for 
empirical probability measures. Ann. Probab. 7 553—586. 


AA RAN tt 
aaa ae enn nen nee en cn RR RANA e e tes 


664 A. D. M. KESTER AND W. C. M. KALLENBERG 


HUBER, P. J. (1968). Robust confidence hmits. Z. Wahrsch. cerw. Gebiete 10 269-278 

JURECKOVA, J. (1979). Finite-sample comparison of L-estimators of location. Comment. Math. Unt. 
Carolin. 20 509-618. 

JURECKOVA, J. (1981). Tail-behavior of location estimators. Ann. Statist. 9 578-585. 

KESTER, A. D. M (1985). Some large deviation results in statistics. CWI Tract 18. Mathematisch 
Centrum, Amsterdam. 

KOUROUKLIS, S. (1984). A large deviation result for the likelihood ratio statistic in exponential 
families. Ann. Statist. 12 1510-1521. 

PFANZAGL, J. (1976). Investigating the quantile of an unknown distribution. In Contributions to 
Statistics (W J. Ziegler, ed.) 111-126. Birkhauser, Basel. 

PINSKER, M. S (1964) Information and Informaton Stabuuy of Random Variables and Processes. 
Holden-Day, San Francisco. 

RUBIN, H. AND RUKHIN, A. L. (1983). Convergence rates of large deviations probabilities for point 
estimators. Statist. Probab. Lett. 1 197-202. 

SIEVERS, G. L. (1978). Estimates of location: A large deviation comparison. Ann. Statist. 6 610-618. 


]JDEPARTMENT OF MEDICAL [INFORMATICS DEPARTMENT OF APPLIED MATHEMATICS 
AND STATISTICS TWENTE UNIVERSITY OF TECHNOLOGY 

UNIVERSITY OF LIMBURG P.O. Box 217 

P.O. Box 616 7500 AE ENSCHEDE 

6200 MD MAASTRICHT THE NETHERLANDS 


THE NETHERLANDS 


The Annals of Statisties 
1986, Vol 14, No 2, 665-678 


THE STATISTICAL INFORMATION CONTAINED IN 
ADDITIONAL OBSERVATIONS! 


By ENNO MAMMEN 
Universität Heidelberg 


Let &" be a statistical experiment based on n 1i1.d observations. We 
compare &” with &"*. The gam of information due to the r, additional 
observations 1s measured by the deficiency distance A(#", &"*"), 1.6, the 
maximum diminution of the risk functions. We show that under general 
dimensionality conditions A(@", &"*"*) is of order r,/n. Further the behav- 
ior of A is studied and compared for asymptotically Gaussian experiments. 
We show that the information gain increases logarithmically. The Gaussian 
and the binomial fammly turn out to be—in some sense—opposite extreme 
cages, with the increase of information asymptotically minimal in the 
Gaussian case and maximal in the binomial. 


1. Introduction. When considering a complicated statistical model it may 
be useful to construct another model which is close to the original one but 
statistically easier to handle. The analysis of the second model may make the 
essential structure of the first model better understandable and help to construct 
suitable statistical procedures for a decision problem. The usual way to get such 
an approximating model is to imbed the original one into a sequence of models 
and to expand the log-likelihood function. Because one is more interested in 
approximations than in limit theorems it 1s necessary to estimate the closeness of 
the two models. A natural quantity for comparing two models or—in more 
common use of language—two experiments is the deficiency distance of Le Cam 
(1964). It is based on the comparison of risk functions available in the two 
experiments. We recall its definition. 

Let &:= (Z, £,(P: 9€ O)) and F= (Y, 8, (Q: 0 € O)) be two experi- 
ments with the same parameter set 9, i.e., two families of probability measures 
(P,: 0 € @) and (Q,: 6 € O) defined on measurable spaces (Z, £) and (Y, 2), 
respectively. & is called e-deficient relative to F (e > 0), if for every finite 
decision space (9, S), for every bounded loss function L: 8 x 7~— R and for 
every decision rule o in F there exists a decision rule p in £ such that for every 
0 € © the following inequality between the risk functions is valid: 


J E0 olx, dt) Pilda) 
(1.1) 
< J J10, t)o(x, dt)Q,(dx) + e sup |L(8,¢)I, 


The deficiency (8, F) of & with respect to F is the smallest e > 0 for which & 


Received July 1984; revised July 1986. 

'This work has been supported by the Deutsche Forschungsgemeinschaft. 

AMS 1980 subject classification. Primary 62B15. 

Key words and phrases. Experiments, deficiency, information, additional observations. 


665 


666 E. MAMMEN 


is e-deficient with respect to #. The deficiency distance is the symmetrical 
quantity A(é, F) = 5(&, F) V (F, &). It defines a psendo-distance between 
experiments. We cite two other characterizations of the deficiency. For a detailed 
. motivation and discussion see Le Cam (1964). 


(i) The randomisation criterion. 
(1.2) 6(8, F) = inf sup||KP, — Qill, 
K §¢6@ 


where the infimum is taken over all transitions which map the band L(&), 
generated by (P;: @ € ©), into the band L(#), generated by (Q,: 9&0). A 
transition K is a positive norm one linear map [i.e Ku* > 0, || Kut) = ljuti, 
K(au + by) = aKu + bKr for np,» © L(&) and a,beE RI]. If (P: 0 € 8) is 
dominated, ¥ is a Borel subset of a complete separable metric space and # is the 
class of Borel subsets of X then it is sufficient to take the infimum in (1.2) over 
all Markov kernels from (Z, £ ) to (Y, #). 


(i) The Bayes criterion. 
(1.3) 5(&, F) = sup{pe(7, L, D) — ps(a, L, DY}, 
where the supremum is taken over all prior measures 7 with finite support, finite 


decision spaces D, and loss functions L bounded in absolute value by 1. 
pelt, L, D) resp. pg(7, L, D) is the corresponding Bayes risk in £ resp. F. 


Unfortunately, the deficiency distance is very difficult to calculate in general. 
But for translation experiments it suffices to take the infimum in (1.2) over 
invariant kernels. This can be used to calculate the deficiency explicitly in some 
cases. For instance, Torgersen (1972) showed that 


1 
A(&z, ptr) = nxt} — (n a at aad dx 
(1.4) RI ÔR f | | 
= (2/e)(r/n) + o(1/n) = 0.73(r/n) 
if &” is the experiment of taking n iid. observations from a rectangular 
distribution on [0, @] for 8 > 0. He also showed that 


ACEL ent” = 2 - all 
(1.5) ( E E ) iix + X „t+r/ 


= /2/ne(r/n) + o(1/n) = 0.48(r/n) 


if &7 is the experiment of observing n times an exponentially distributed 
variable. x? ,,,/, denotes the distribution of (1 + r/n)X, if X is distributed 
according to x*. Another example is 


(1.6) A(Z”, G"*") =||N(0, I,/n) — N(0, I,/(n + r)) | 


for the Gaussian shift experiment 9-=(N(6,>): 0€ Rf). I, is the k Xk 
identity matrix and = a positive definite k X k matrix. 


ADDITIONAL OBSERVATIONS 667 


For k = 1 (1.6) yields 
(1.7) A(G", F"*") = /2/ner/n + o(1/n). 


The results in Le Cam (1964) and Torgersen (1972) on invariance are also used in 
Swensen (1980) to compute deficiencies between linear models. 

For one-dimensional exponential families & Helgeland (19&2) has calculated 
lower and upper asymptotic bounds for A( £”, €"*"): 


n 
y2/re < liminf —A( £”, €"*"*) 
nx Tan 


n 
(1.8) < limsup —A( £”, €"*") 
r 


n of n 


< 2y2/ze, 
provided r, < n* for some fixed £ < 1. 

For finite parameter sets the behaviour of the products £” cf an experiment & 
has been studied by Torgersen (1981). Using the deficiency he compares £” with 
the totally informative experiment and the least informative experiment. In the 
context of robust statistics the deficiency distance can be used to measure how 
much the assumed model differs from the true model. Miller (1980/81) gives an 
estimate of the deficiency distance between two different models, in terms of the 
bounded Lipschitz distance between the probability measures. This estimate is of 
the correct order as the Lipschitz distance tends to zero. 

Some of our examples deal with measuring the distance between an experi- 
ment &= (Z, &,(P,: 0 E€ O)) and some subexperiment E’ = (F, of’, (Ply: 
0 e O)) (i.e, £’ Co). A quantity 7(&’, £) called insufficienzy has been intro- 
duced for this situation by Le Cam (1974). The insufficiency seems to be more 
tractable than the deficiency. It measures how much the P,’s must be modified in 
order for Z” to be sufficient. Under certain regularity conditions on the experi- 
ments £, &’ the insufficiency 7(é’, £) can be defined by 


n(é&’, &) = inf sup ||P, — Pil], 
Geo 


where the infimum is taken over all families of measures (Pj: 0 € @) for which 
wf’ ig sufficient and P; and P} agree on »f’. For a general definition see Le Cam 
(1974). 

The notion of insufficiency can be used to measure the information contained 
in additional observations. Le Cam (1974) gives the following zeneral estimate: 


(1.9) n(€",€"*") < (8B yr/n 


with £ := infgsupy. grE,H*(P,, Py). Here the infimum has to be taken over all 
randomised estimates § of @ in the experiment £”, and H(-,-+) is the Hellinger 
distance, i.e., H*(P,Q):= f(YdP — dQ)? for two probability measures. Using a 
dimensionality condition Le Cam has shown that £ is bounded [see Birgé (1983)]: 


(1.10) B<65D +55 


668 E. MAMMEN 


with a dimensionality constant D defined as follows: 


Consider A(0, 7) = H( P}, P.) as a pseudo-distance on 8. Then D is 
(1.11) the smallest number such that, for every ô > 0, every subset of 6 
with diameter ô can be covered by 2” sets of diameter 5/2. 


(1.9) and (1.10) yield n(@", €"*') = O(1/ yn) for finite dimensional experiments. 
For k-dimensional Gaussian experiments 9 this is the right order [Le Cam 
(1974)]. Because of n(@, F) = 6(&, F) the insufficiency can be used to calculate 
upper bounds for the deficiency. For instance one gets 6(£", €"*') = O(1/ yn) 
for finite dimensional £. 

As in the most given examples, in this paper we are concerned only with the 
calculation of the deficiency for the special case of comparing an experiment for a 
different number of observations. First we will show that 6(@”", €"*') = O(1/n) 
for finite dimensional £. This improves the above mentioned result which was 
based on the calculation of the insufficiency. In the rest of the paper the increase 
of information 6(&", €"*') will be studied for asymptotically Gaussian experi- 
ments. Then ô( £", &°") converges for n > œ to a constant depending only on a. 
Thus the information increases, as it were, logarithmically. For one additional 
observation the increase of information turns out to be asymptotically minimal 
for Gaussian experiments and—among exponential families—maximal for a 
binomial family. Further investigations of 6(é", &"*") for exponential families 
can be found in Mammen (1983). 

The following section formally presents our results and mentions some of the 
main elements of their proofs. Detailed proofs of these results are contained in 
Section 3. 


2. Results. To prove his upper bounds in (1.8) Helgeland (1982) made use of 
the randomisation criterion (1.2). He constructed a kernel as follows. First 
estimate the parameter 8, then generate a random variable distributed according 
to the estimated measure and mix this variable randomly among the observations 
drawn first. This idea can also be used in the more general situation when in an 
experiment £ = (Z, of,(P,: 0 € @)) estimators exist which are Yn -consistent in 
the following sense. 


There exist positive constants y and B such that for every n there 
(2.1) is an estimate 6, in &" with (i) E,exp(ynh7(0, 8,)) < B and (ii) 
x > Ph (,(A) measurable for A € g. 


Here h(6, 7) := H(P,, P.) is the pseudo-distance on © induced by the Hellinger 
distance. 


THEOREM 1. Let & be an experiment satisfying (2.1). Then there exists a 
constant C such that 


r 
(2.2) ae, &"*") < Cm. 


ADDITIONAL OBSERVATIONS 669 


In particular, if £ is an experiment which is finite dimensional in the sense of 
(1.11) then (2.1) and therefore Theorem 1 holds. Then the constant C depends 
only on the dimension D. This can be seen from results of Birge, who proves (2.1) 
for finite dimensional experiments [see Dacunha-Castelle (1978) and Birgé (1983), 
where slightly different dimensionality conditions are used]. 

As a first step for calculating limit experiments Le Cam (1968) associates 
products of experiments £” with Poisson experiments #”. The Poisson experi- 
ment is defined according to the following rule: first observe a Poisson variable N 
with mean value n. Then carry out the experiment £~. When applied to this 
situation Theorem 1 yields the following estimate: 


COROLLARY. For experiments & fulfilling (2.1) one has 
(2.3) A(&”, P”) = O(1/yn). 


Now we discuss the case when the products @” of an experiment g= 
(2, &,(P,: 0 € @)) can be approximated locally by a Gaussian experiment. 
More precisely we assume that O cC R* and suppose that there is a 6, € © such 
that the following condition holds. 


There exists a positive-definite k x k matrix 2, such that for all 
c>0: 


(2.4) ME? G72) 70 forn > o, 


where &, = (F, £, (P; || - Ql| sc/v¥n)) and 9,5 = 
(R*, B(R*), (NØ, E): || — boll < c/ Yny). 


Here @(R*“) denotes the Borel-c-algebra of R*. Sufficient conditions for (2.4) can 
be found in Le Cam (1968). For instance the following holds: Assume &:= (P;: 
0 € Q) is an experiment with @ cC R*. Let 6, be an element of the interior of 8, 
such that for ĝ in a neighborhood of @,, the measures P, can be dominated by a 
finite measure m. Further assume that the function (8) = /dP,/dm mapping ® 
into L*(m) is Frechet-differentiable at 6, with derivative é(9,) € L?(m)*, and 
that the matrix T = {£(6,)&(6,)’ dm is positive definite. Then (2.4) holds with 

:= T°. For more general conditions for local Gaussian approximation see Le Cam 
(1985). 

Under further conditions local Gaussian approximations can be pieced to- 
gether to a global approximation of £” by a heteroscedastic Gaussian experi- 
ment. This was discussed in Le Cam (1975). For © c R* we state such an 
approximation. 


There exist positive definite k x k matrices [(@) depending con- 
(2.5) tinuously on @, such that A(é&’, $")-»( for n— oo, where 
F = (R*, B(R*), (N(O, P~'(8)): 6 € 8)). 


It can be shown that 
A(G", rtr) = ACG", Grt) + O(1), 


670 E. MAMMEN 


where is a k-dimensional homoscedastic Gaussian experiment. This proves the 
following theorem. 


THEOREM 2. Assume & = (Q, xf,(Py: 0 € @)) is an experiment with 8 c RÈ. 
Assume further (2.5). Put 9 := (R*, @(R*),(N(O, I,): 6 € R*)). Then for every 
sequence of integers (r,,) the following holds: 


(2.6) ACE”, E*n) = A(G”, G+) + o(1). 


The statement of Theorem 2 is interesting only in the case of a large number 
of additional observations when r, is of order n. This is because (2.6) holds 
trivially if r, = o(n), since then A(£”, €”"™) = o(1) and A(Y”, 9ta) = o(1) as 
can be seen by (1.6) and Theorem 1 using the fact that & is finite-dimensional in 
the sense of (1.11). 

In the special case where the number of additional observations is proportional 
to n [r, = [an] = sup{k € N: k < an} for a constant a] one obtains by (1.6) 

A(&", Ell) = A(g", glen) + o(1 
ion (4", sta") = A( ) + o(1) 
=|| N(0, al,) — N(0, I,)|| + o(2). 


Here the increase of information depends asymptotically only on the dimension k 
and the constant a. Thus the information increases, as it were, logarithmically in 
the number of observations. 

Now we show that for a small number of additional observations the gain of 
information is asymptotically minimal in the case of Gaussian experiments. 


THEOREM 3. Let €= (2%, #%,(P;: 0 € @)) be an experiment with © C RÈ. 
Assume that & can be approximated locally by homoscedastic Gaussian experi- 
ments in the sense of (2.4) at a point 6, contained in the interior of 8. Then for 
all sequences (r,,) with r, = o(n) the following holds. 


n n 
(2.8) liminf —A(&", &"*") > lim —A(¥", g+), 
now TF, n> o Fan 


n 


Theorem 3 is a generalization of the lower bound of Helgeland (1.8). The proof 
consists of the following simple arguments: First, for r fixed, A(8”, &”*") 
decreases in n (later observations are less informative). Asymptotically a large 
number of additional observations in the experiment @ is not less informative 
than in a Gaussian experiment. Finally in Gaussian experiments the information 
increases almost additively. Combining these arguments one obtains the follow- 
ing asymptotic inequalities, where / has to be chosen suitably depending on n: 


“Ae grrr) > “Ae? grr) 
r , Ir i 
n 
z po 97") + o(1) 


n 
= —A(9",9"*") + o(1). 
r 


ADDITIONAL OBSERVATIONS 671 


The next theorem shows that the upper bound in (1.8) is sharp and is attained by 
the binomial family. 


THEOREM 4, Assume 0<a<i<b<1. Let & = ({0,1},20",(Q@,: pe 
(a, b)) be a Bernoulli experiment: 


Q,({x}) =pxt+(1-p)(l-x) forx € {0,1}. 
Then 
(2.9) A( ER, &8*1') = 2A(97, 9"*!) + o(1/n). 


Further investigations of A( £”, £+") for exponential families can be found in 
Mammen (1983). There the inequality of Helgeland (1.8) is generalized to arbi- 
trary finite dimensional exponential families. As in the one-dimensional case the 
increase of information is asymptotically at most twice as much as in the 
Gaussian case. Furthermore, it turns out that one has to distinguish two cases. If 
the measures of the exponential family are lattice distributions the gain of 
information is asymptotically strictly larger than in the Gaussian case. For 
strongly nonlattice distributions (i.e., measures fulfilling Cramér’s condition) the 
information gain increases exactly as in the Gaussian case, asymptotically as 
n — oo. The proof of these results is based on Edgeworth expansions of Bayes 
risks which old uniformly over all Bayes decision problems with bounded loss 
function. 


3. Proofs. 


PROOF OF THEOREM 1. It suffices to prove Theorem 1 for r = 1. We con- 
struct a Markov kernel K from (%", £”) to (ZP+! w"*'), Let m, <n bea 
sequence of natural numbers with 


(3.1) m,/(n+1—m,) <7, 
(3.2) n/m, = O(1). 


According to (2.1) there exists an estimate 6, in €” depending only on the first 
n+ 1 — m, observations with 


(3.3) E,exp(y(n + 1 — m,)A?(6,6,)) < B. 
Using (3.1) one gets 
(3.4) E,exp(m,h?(0,6,)) < B. 


The kernel K is defined as follows: 


(3.5) K=(1/m) }, 8, X- x8 xX Py X ôe Xs KO. 


Kamt: 
i<is<m 


672 E. MAMMEN 
Here the index n of m is dropped. The randomization criterion (1.2) yields 
A( 8", 87+?) < sup ||KP? — PP" 
EO 


(3.6) 
d? ĝ IPF) . 





= sup [lum Y Bx Py X Port — Py 


EO lsism 





To complete the proof we use the following lemma. 


LEMMA 1. For two probability measures the following holds: 
2 

l/m > Ph IK Qx Pp™ pr” 
(3.7) | Ilstsm 


< exp((m — 2)H*( P, Q))4H*( P, Q)[ m7! + H?(P,Q))]. 


For example it follows from Lemma 1 that given two sequences of measures 
(P,) and (Q,) with H(P,,@,)=O(/vn), then |1/nE; s: snPi X Qn X 
P"' — P| is of order 1/n, whereas ||P"~' x Q, ~ P7 = ||P, — Q,l| may be 
eventually of order 1/ yn . Thus Lemma 1 underlines the importance of randomly 
mixing the generated variable in the construction of K. 








APPLICATION OF LEMMA 1. Using Lemma 1 one gets from (3.6) 
AE”, €"*1) 


< sup Byexp| = }?(0, 5,)) (47200, 6 )/m + 4h*(0, 6.))”. 
668 


An application of the Cauchy—Schwarz inequality, (3.4), and (3.2) finishes the 
proof. 


PROOF OF LEMMA 1. Set M := (P + Q)/2, g(x) = dP/dM(x), and h(x) = 
2 — g(x) = dQ/dM(x). The following holds: 
2 
lym » P1xXQx Pm'— P” 


l<isgm 








_ [ffm E a(x.) (2,1) 


l<ism 


x (Alx) = eleele) 8lEn)|, TT Mede) 


sism 


< f (1/m 2 gla) <-> glaa) ~ eledele) E 82m), 


lsism 


x [I M(dx,) 


lgiısm 


- 1/m JCh- e)? am| feta)” 
- (fn - agam) [ fe? am 


— 2 





oe | 
te 
m 


ADDITIONAL OBSERVATIONS 673 
The following estimates yield the lemma: 
f(h - g} dM = 2 f (dP - dQ)’/(aP + dQ) 
(3.8) = 2 f(VdP - ydQ) (vaP + ydQ) /(dP + dQ) 
< 4H*( P,Q), 


[S(r - e)eam| =| [Ch - g)gdM — 4 f (h - 2)(h + 8) aM 


(3.9) = ł [(h - g} dM 
< 2H’(P,Q), 
fe?am =} f (Ce - h) + (e+h)) aM 
(3.10) =} f(g8-hdM+1 
< 1+ H’(P,Q), 
e f ga) < (1 + H?(P,Q)) 


< exp(mH?(P,Q)). 
PROOF OF THE COROLLARY. Assume N is a Poisson variable with EN = n. 
The following holds: 
A(E", P”) s lE, Pr) + 8(P", Er) 


< E P(N=k)A(S", &*) 
k20 


<2P(N<n/2)+ }, P(N=k)A(8", 8*). 


kzn/2 
Clearly, the first term is of order o(1/ Vn ). To treat the second term we use the 
following lemma. 


LEMMA 2. For every experiment € andr E€ N the deficiency A(&", &"*") is 
monotonically decreasing in n. 
APPLICATION OF LEMMA 2. One gets 


E P(N=k)A(E", &*) < YL P(N=h)|n— k| A(E@ 7), 61777141) 
kon/2 kRan/2 


< EJN — njO(1/n) = O(1/yr). 


674 E. MAMMEN 


PROOF OF LEMMA 2, Using the randomization criterion (1.2) one gets 
A( ert! Ertitr) = (grt! GENE] 


inf sup ||KP;'*! = PANET] 
K gee 


inf sup ||( LPF) x P- Ppt’ x P| 
L gee 


I 


lA 


I 


inf sup || LPF — PP*'|| 
L gee 


= A(&", "*"), 
Here the infimum is taken over all transitions which map the band L(&"*') into 
L(&"*'*") or L(€") into L(€"*"), respectively. 


PROOF OF THEOREM 2. It suffices to show 
(3.12) A(G", Grrr) = A(G", g>) + (1). 
Firstly the following holds: 
A(G", G+) = inf sup | KN(, nT- 1{0)) - N(0,(n + r,) 'T-8))| 
6&6 


(3.13) 





< sup | N(I/7(0)0, np) — N(T™(0)0, (n + 7.) Ie) 
#EO 
< A(G", grt), 
(The infimum has to been taken over all kernels from (R*, @(R*)) to (R*, B(R*)). 
To prove the other direction we use the fact that asymptotically it suffices to 
calculate the deficiencies for local subexperiments of 9" and 9"*"; 


A(Z, grt") 
(3.14) = ae infsup {|| KN(0, nI) 7 N(9,(n + ee |: 
` TER 


l8 — ril sn + off) 


for a > 0. (3.14) can be deduced from the existence of an estimator whose 
probability being outside an n*~'/*-neighbourhood of the true parameter de- 
creases exponentially in n. For a detailed proof see Theorem 1 in Le Cam (1975). 
Using (3.14) one gets 


ACF", G+) = inf sup {| KN(9, n—I,) 


-N(0,(n F ra) Ly) eno < faie) + o(1) 


(3.15) ~N(6,(n + r,) 'T'-0)) | r"2(0)8|| < net) + o(1) 
= inf sup {|| KN(9, n-'T-*(8)) 
-N(6,(n + 7,) T0) |1200] < n112} + of) 
< A(G", grtn). 


ADDITIONAL OBSERVATIONS 675 


PROOF OF THEOREM 3. Without loss of generality we assume r, = 1. Using 
Lemma 2 and the triangle inequality one gets for a constant a > 1 


Ale”, gian) < Ale”, gtr east +A( gier- glan]) 


(3.16) 
< ([an] - n)A( £”, &"*'). 
This gives 
(3.17) nA(&", &"*") > (a— 1) ACE”, g1), 
According to the assumptions there exists a sequence (d,,),,, in R* such that 
(3.18) d,- œ forn > œ and d,/yn is monotone decreasing, 
(3.19) A( as Faz) -0 asn—> oœ. 


Setting m := [an] and c, = /n/md,, one gets from (3.19): 
(3.20) AE", Iro s) =AlEry Gry 5) 70 forn > o. 


Further c,/ Vn = d,,/¥m < d,/ Vn implies 
(3.21) ACE? ., 
With ©, = {0 € ©: ||9 — l| < c,/ Yn } (3.20) and (3.21) entail 
A(é", &") => AEF o 87.) 
= A(z ep, È) G7. s) + o(1) 
(3.22) = A((N(6, =)": 6 € 8,),(N(6, 2)”: 6 € @,)) + o(1) 
= A((N(8, 2): 10I < c,),(N(8, (m/m)z): 10l < ¢,)) + o(1) 
= A(( N(0, Z): ||6]] < c,),(N(8, a7 15): 1141] < ¢,)) + o(1). 
Using the Bayes criterion (1.3) this gives 
= A((N(6, 2): 6 € R*),(N(8, a 'Z): 0 € R*)) + o(1). 
According to Torgersen (1972) [see also (1.6)] this is asymptotically equal to 
=||N(0, 2) — N(0, a~'Z)|]| + o(1) 
=||N(0, I4) — N(0, af,)|| + 0(2). 
Putting (3.17) and (3.22) together one gets 
(3.23) nA(&", €"*!) > (a—1) `|N(0, I,) — N(0, al,)|| + o(1). 


For a > 1 the last term converges to lim, .,nA(9", 9"*') as can be seen by 
using (1.6). This completes the proof. 


gn )>0 for n > oo. 


n, Cy, & 


PROOF OF THEOREM 4. According to (1.7), (1.8) it suffices to prove: 
(3.24) lim nA( £}, 8+!) > 2/2/ze. 
n- o 


Put 6 = In( p/( — p)), h = ¥n8, c(8) = In(2(1 — p)). The Bernoulli distribution 
has the following density with respect to the uniform distribution on (0, 1}: 


exp(@x + c(8)). 


676 E. MAMMEN 


One can write 


6 0 ia +A . 
e0) — e(0) =~ se - 5 + a 
where A(h/ Yn) = c0 h/ yn 3/6 for some 6’ between 0 and @. For proving 
his asymptotic lower bound (1.8) Helgeland (1982) considers the following Bayes 
decision problem: Let 0 < a < t and c, = n°. 

Given a prior distribution having density wrt Lebesgue measure 


(3.25) rexp{—nA( A/V) = A2/2e2 VT, e (h), 


construct a confidence interval of length 21 = 2(1/4 + 1/x*)~'/*. The loss func- 
tion is —1 or 1 according to the true parameter falls into the confidence interval 
or not. 

We modify this decision problem slightly. Consider the prior measure À, 
having density (3.25) with respect to the measure: 


vA) = #(AN (a,Z + b,)), 
nt+1 1 | 1 


dn’ 


s 





ona renee + PRR 
4n K’ 
b, = —a,(n + 1)/2. 


Further suppose that the confidence intervals to be constructed are closed with 
length 2l, = [2//a, Ja,,. 

Arguing as in Helgeland (1982) one can show that for m =n and m=n+1 
the a posteriori distribution function H,,,(t|X™) in @™ fulfills the following 
approximation ( X” = (X,,..., Xm) is the vector of the first m observations): 


(3.26) Epa, (sup [Emn(41X") - Hnq(t1X™)|) = o(1/n), 


where 


Fanti") = f ==], nde) amns 


— OO “mn mn 

; m 1 als 
E Sr 

mn 7 Ba ; 


Lmn = O ar r ( a a 
=. x, 
i=} 


+o l X— By, i 
o Í (==) (a) . 
Sieg Orns o 


mn 





The main idea of the proof is that the constants a,, b, are chosen so that u,,,,, 
is an element of the support of the prior measure A, (and r„) and the point tpn 
lies close to the midpoint of two neighbouring points of the support of A,, (see 
(3.32)). (3.26) can be used to construct confidence intervals C,,C,,, having 
asymptotically minimal Bayes risk. For m = n + 1 choose the interval p,.,, £ ln 
and for m = n choose Z,,, + 1,, where ñ, is the point in a,Z + b, closest to 

an: According to (3.26) the corresponding Bayes risks differ from the minimal 


ADDITIONAL OBSERVATIONS 677 


Bayes risk by o(1/n). Using (1.3) one gets an asymptotic lower bound for the 
deficiency: 








(3.27) A(€3, EB) = Pan Patin + O(1/n), 
where for m = n, m = n + 1, resp. 
1 X Bin 
Pmn = Epmy, 1—- 2f oA"), (de) ann 
Ca Imn mn 


Put A, = (x" Ersen (@— 1/2] < n*'7}, Then EX,- }= O(n*-¥?) 
uniformly for 0 in the support of A,, and therefore 

(3.28) P” (AS) = o(1/n). 

Uniformly for X”+! e A, X {0,1} one gets Opin ~ Onn = OCN), Basin 7 
Enn = O(n™'/*), This implies 











(3.29) ain/ahrin = 1+ O(n */*), 
(3.30) a’ /a,=1+ O(n"). 
Using (3.28)~—(3.30) one gets 
(3.31) Pnn Pa+in ~ Lin + Ia, + o(1/n), 
where 
L, 1 x 1 x z 
L,=2f -=Z + of |} ra, 
by Onn Onn On+in On+in 
i. l x X= by, t Ban ve 
fan p Epaf" nn el 7 | Onn pCa). 


y,, is the following normalised counting measure: 
P(A) = a,#(ANa,Z). 
As in Helgeland (1982) one shows that 


3.32 I 1+ oe: : + : 
(3.32) T 3| nV ne o| ); 


Further, for n large enough, the following holds in A, with = 1 or = —1: 


n n 
Unn ~ ban ie Opn n 1? 2 (X, 7 1) ai Gian e » (x, E 1) z Eoen A2 


t=] 1m] 
= —fa,/2 + O(n?*"!). 
This can be used to evaluate I,,. Put ô, = (finn — Bnn)/Onn and J, = 
(—1,/ Onn: ln/ mn) and v/(*) = b((+)o,,). Then 


Izn = 2E pa, f (2) — o(x + 6,) (dx) 
(3.33) = Ba, f o(#){#8, + (1 ~ =7)83/2} 


—$(x + 6,(x)){3(x + 8,(x)) 
—(x + 8,(x))"}83/6 »;(dx) + o(1/n), 


678 E. MAMMEN 


where @, is the restriction of P”A, on A, and 6,(-) is a function with 
[5,,¢-)| Ss |ôņl- 
Evaluating the integrals in (3.33) one gets 


In = Egb? f $(x)(1 — x?) de + o(1/n) 
dp 


(3.34) a?/40,,?[ yo(y)]_, + o(1/n) 


4\"'1 /2 
+5 m E + o(1/n). 


Since «x? can be chosen arbitrarily large, (3.24) follows from (3.27), (3.31), (3.32), 
and (3.34). 


Il 


Acknowledgment. The author wishes to thank the referees for many help- 
ful suggestions. 


REFERENCES 


BirRGE, L. (1983). Approximation dans les espaces métriques et théorie de l'estimation. Z. Wahrsch. 
verw. Gebiete 85 181-237. 

DACUNHA-CASTELLE, D. (1978). Vitesse de convergence pour certains problemes statistiques. In 
Ecole d'Eté de Probabilités de Saunt Flour VII-1977. Springer, Berlin. 

HELGELAND, J. (1982). Additional observations and statistical information ın the case of 1-parameter 
exponential distributions. Z. Wahrsch. verw. Gebiete 59 77-100. 

LE CaM, L. (1964). Sufficiency and approximate sufficiency. Ann. Math. Statst. 35 1419-1455. 

LE CaM, L. (1968). Théone asymptotique de la décision statistique. Les Presses de l’Université de 
Montréal. 

LE CaM, L. (1974). On the information contained in additional observations. Ann. Statist. 2 630-649. 

LE CaM, l. (1975). On local and global properties in the theory of asymptotic normality of 
experiments. In Stochastic Processes and Related Topics (M. L. Pun, ed.). Academic, 
London. 

LE CaM, L. (1985). Sur l'approximation de familles de mesures par des familles Gaussiennes. Ann. 
Inst. Henrt Poincaré 21 225-287. 

MAMMEN, E. (1983) Die statistische Information zusatzlicher Beobachtungen. Inaugural-Disserta- 
tion, Heidelberg. 

MULLER, I). W. (1980 / 81). The increase of risk due to inaccurate models. Symposia Mathematica 
25 73-84. 

SWENSEN, A. R. (1980). Deficiencies in linear normal experiments. Ann. Statist. 8 1142-11565. 

TORGERSEN, E. N. (1972). Comparison of translation experiments. Ann. Math. Statist. 43 1383-1399. 

TORGERSEN, E. N. (1981). Measures of information based on comparison with total information and 
with total ignorance. Ann. Statst. 9 638-657. 


UNIVERSITAT HEIDELBERG 

IM NEUENHEIMER FELD 294 

6900 HEIDELBERG 1 

FEDERAL REPUBLIC OF GERMANY 


The Annals of Statistics 
1986, Vol 14, No 2, 679-600 


AN APPROACH TO UPPER BOUND PROBLEMS FOR RISKS OF 
GENERALIZED LEAST SQUARES ESTIMATORS 


By YASUYUKI TOYOOKA AND TAKEAKI KARIYA 


Osaka University and Hitotsubashi University 


First, an approach to an upper bound for the rsk matrix of GLSE’s is 
established when the information on the parameter space of the structural 
parameter in the covariance matrix of the error can be utilized. Second, this 
result is applied to regression with (i) serial correlatian and (ii) heteroscedastic 
covariance structure. In the heteroscedastic regression, the problem of esti- 
mating the common mean of two normal populations is studied in detail. 


1. Introduction. As is well known, in the regression model 


(1.1) y=XBtu,  E(u)=0 and Cov(u) =075, 
the Gauss—Markov estimator 
(1.2) RCE) = (XD UX) UX Soy 


is the best linear unbiased estimator of 8, provided = is known. Here X is an 
n xX k fixed matrix of rank k and È is a positive definite matrix. Often, however, 
> is a function of an estimable parameter, say = = (0), so that È can and must 
be estimated based on y. In such a model, È in (1.2) is replaced by an estimator, 
say Ê = (6), and estimators of the form (=) are often used in practice [see 
Theil (1971), Chapter 6]. In this article an estimator of this form shall be called a 
generalized least squares estimator (GLSE). 

In many applications, the estimator for @ is based on the ordinary least 
squares (OLS) residual, 


(1.3) e= Ny= Nu with N=I— X(X’X)7'X". 
In Kariya and Toyooka (1985), when $ = 2(6(e)) and when the density function 
of u belongs to a class of spherical density functions with mean 0 and covariance 
o>, 

f(u) = |o?3\~/7q(u’(o?S)*u) 


(where the class is denoted as S.(0,07) below), it was shown that the risk 
matrix of (£) is bounded below by the covariance matrix of ($): 


(1.4) R(B(2)) = E[B(S) — B][B(2) - 8Y = Cov( B(2)) = o X2x y, 


where throughout this article, inequalities for matrices should be understood in 
terms of nonnegative definiteness. Moreover, it is shown that (1.4) is valid for the 


Received December 1983; revised March 1985. 

AMS 1980 subject classifications. Primary 62J05; secondary 62M10. 

Key words and phrases. GLSE, heteroscedasticity, intraclass correlation, serial correlation, upper 
bound for risk. 


679 


680 Y. TOYOOKA AND T. KARIYA 


class which contains A(3), 
= {BIB = C(e)y, C(e): kX n measurable function of e such 
that C(e)X = I and E|jB\\? < 00}, 
where |jal|? = a'a for a € R*. On the other hand, the uniform bounds for the 
approximation to the p.d.f. and the c.d.f. of GLSE were given in Toyooka and 
Kariya (1983). 

In this paper we consider the problem of evaluating an upper bound for risk 
matrix of a GLSE in (1.4) under normality of u. Kariya (1981) also derived upper 
bounds for the risk of some GLSE’s in Zellner’s SUR model and a heteroscedastic 
model. Our approach here is different, more systematic, and thus applicable to 
many regression models with complicated covariance structure. As discussed in 
Remark 2.2 and Remark 2.3, our resulting upper bound uses the fourth moment 
of the estimator for the structural parameter to preserve the magnitude of the 
order. On the other hand, the upper bound using the second moment does not 
preserve the magnitude of the order which is discussed in Remark 2.2. In Section 
2 a general approach to the upper-bound problem is established and in Section 3 
it is apphed to regressions with (i) serial correlation and (ii) heteroscedastic 
covariance structure. In the heteroscedastic regression, we treat as a special case 
the problem of estimating the common mean of two normal populations and 
compare the upper bound with the exact variance [see, e.g., Khatri and Shah 
(1975) and Cohen and Sackrowitz (1974) for the problem]. 


2. Upper bound for the risk matrix of GLSE. Let 
(2.1) y=XB+u, E(u)=0, and Cov(u) = 0*X(8), 
where X is a fixed n X k matrix of rank (X)= and 6€© (nonempty 
open) c R'. Assume that 2(@) is nonsingular in © such that 
(2.2) 5-(9) = I, —2,(8)C, 
where à (0) is a continuous function of 9 defined on À into R'. Let B be an 
orthogonal matrix such that 
(2.3) B'CB = diag{d,, d.,...,d,} =D withd,< © <d.,,. 
Using B and rewriting (2.1) as B’y = B’XB + B’u, without loss of generality, we 
can assume 
(2.4) =O) =1,+X,(6)D. 


And we further assume that d, > 0 for all z and d, > 0. Typical examples in 
which (2.4) is satisfied are the covariance structure of serial correlation, intraclass 
correlation, and heteroscedasticity. The parameter @ is to be estimated based on 
the OLS residual e in (1.3), which is often possible. For A = 4,,(8) = A(@) in (2.2) 
we shall consider an estimator of the form À = A(6), where we assume 6 €@ so 
that À € A = {AJA = ACA), 0 & O}. 

To state a main result in this section, we introduce some notation. Let Z be 


(2.5) ZZ'=N, ZZ=I1,.,, and Z'X=0 


UPPER-BOUND PROBLEMS FOR RISKS 681 


and fix it throughout the article. Let 


A=(X’'D 1X)", X= = AA, 
(2.6) 


Z=DZ(zsz)'"” and T=[X,Z]. 


Then F is an n X n orthogonal matrix. Define 
: = Xe 
(2.7) iera- |y \-| 


where ù = 5~'/2u. Now from (1.2) with $-! = I + ÀD, the GLSE A(S) is 
expressed as 


A(S) — B= (X'S UX) X’ È 'u 
— Al/2 Yrs oy “hys- lpg 
(2.8) ak ü 
= Aù + A(X’ Ž-X) XŠ- Zi, 
= (I) + (IT) (say), 
where 
(29) S=>7'73(6)S-'7,  §=O(e), and e=2(Z'SZ) iy. 
Note that the second term (JI) is a function of &, only. 


LEMMA 2.1 (Kariya and Toyooka, 1985). Let iz € S (D, I). If the second 
moments of B(>) exist, then 


R = E[B(2) - B\[B(2) - 8)’ =07A + APEŢ AW] A” 


= Cov( Å(E)) + E[B(S) — ÊSA) - ACY, 
where A = (X'S E X'S- Zi, 


(2.10) 


First it is remarked that this result is not restricted to the case 2(8)7' = I + 
A(0)D but it holds for any form of È. Second, a sufficient condition for the 
existence of the second moments of A(=) — £ is that § = 6(e) is continuous and 
scale invariant, i.e., 6(e) = @(ae) for a € R' [see Kariya and Toyooka (1985)]. 
Third, in the decomposition of the risk matrix in (2.10), the first term is the risk 
of the Gauss—Markov estimator and the second term is the loss of efficiency due 
to the estimation of 8. 

The evaluation of the loss A'”7E[AA’]A!” is our concern. To do so, let 


B,={k-Az0},. B = {N-A <0}, 
(2.11) (W,=1, W,=(1+Ad,)/(1+Ad,)’, 
F = diag{d,/({1 + Ad,),...,d,/(l+Ad,)} and L= XFZ. 


682 Y. TOYOOKA AND T. KARIYA 


_ THEOREM 2.1. Assume that =~ (0) is of the covariance structure (2.4) and 
0 € © (a.e.). Then 


(2.12) Al?E[ AN ]A' < (g, + 85)A, 
where g, = E[xp (À — APW, Ù L'LĂ,] (i = 1,2). 
Proor. Since 3 = 7)? 37-1/2, 
(2.13) A = (X’HX) 'X'HZù, with H = 3-' = X'$, 
Then 


oe 1+hd, 1+dd, 
ia es Cae SO 
=I1,+(A-A)F. 


From (2.13) and X’Z = 0, 
A= {I,+ (A —A)X’FX} (À -A)X'FZiy. 
Let J = I, + (À — A)X'FX. Then 
(2.14) A = (À — A) J Lito. 
Now, for any a € R*, by Schwarz’s inequality, 


a’E[ Ab Ja = E[(X\ -Aja J LÈ, |" 
(2.15) 


< E[(\—A)ad~*ait, L’Lit, |. 


Since J depends on the sign of (A — A), we evaluate it on each B, in (2.11). On 
B, = {A-Az0} 


J=I1,+(X\-A)X'FX 2 I, 
since X’FX > 0. Then with W, = 1 in (2.11) 


(2.16) a'J~*a < aa = Waaa. 

Next, let 

(2.17) = max {d,/(1 +Ad,)} =d,/(1 + Ad,), 
lgtgn 


since f(x) = x/(1 + Ax) (x 2 0) is increasing in x for any A. Then on B,, 
J>1,+(A—-A)ul, = {1+ (A-A)w}I,, 
whence with W, = (1 + Ad,)?/(1 + Ad,)* in (2.11), 


(2.18) a’J~*’a s {1+ (A — A )w} "aa = W,a’‘a. 


UPPER-BOUNI) PROBLEMS FOR RISKS 683 
Therefore from (2.16) and (2.18), 


2 
a’E(AW Jas E E|x_(\-A) WĀ, L LĀ,]| a'a 
(2.19) pom] 
= (g, + g,)a’a. 

Thus, the desired result is obtained. O 

REMARK 2.1. In the original term, /L’Lit, is expressed as 

ii, L'Lit, = y |I- EXXX |x’ Gl- XXE XE | y 
where G = DUM?FS U2 X( X'S UX) X'S eR, 


To evaluate the upper bound in (2.12) further, we assume that 
~ N(0,o7I,,). 


LEMMA 2.2. When v ~ N(0,07I,,) and C is an m X m matrix, 
(2.20) E |(v'Co)?] = ol (tre)? + trC'C + trc?|. 


Using this lemma with C = L’L and applying Schwarz’s inequality to g, in 
(2.12) yield 


(2.21) g. < o?| Exa WÀ ` ay) Ferr) +2 LLY, (2 = 1,2). 
Now, combining (2.21) with (2.12) and (2.10), we obtain 


THEOREM 2.2. For È in (2.4), 
(2.22) o°A < R= R(Â(È)) < 0°A + 6(8, + &)0%A, 
where ô = [(tr LL’)? + 2t LL’)*]'? and ë, = [E{xp WR oe A)*y'? (z g 1,2). 


If 6(e) is an even function of e, A(X) is unbiased for B in which case 
R(B(2)) = Cow BS). 

A difficulty here is the evaluation of g, and g,, which may be replaced by 
[EÑ — X)4}]'? and [E(W(A — A)*}]'”, respectively, or may be both replaced 
by [E{(A — A)*}]'?. 


REMARK 2.2. As is discussed in Section 3, 6 = 6(e) with e = Z(Z’EZ)'*iz, 
in (2.9) is often assumed to be § = O(ae) = 0(e) for a € R'. So we assume 


6 = Aù »/\\it,||). Therefore, À = Nit,/|liz all). In this case, another evaluation for 
(2.15) is possible via Schwarz’s inequality, i.e., for any a E€ R*, 


2 
a’E[ AN’ Ja <(n—k) X, E|(À -AY W,] a'aô,, 
i=] 


where 6, = tr L’L. However, the r.h.s. of the above inequality is generally O(1) as 


684 Y. TOYOOKA AND T. KARIYA 


n > œ since E [(A — A)*W,] = O(1/n). Therefore, this upper bound does not 
preserve the magnitude of the order as compared to the upper bound obtained in 
our theorem. See also Remark 2.3. 


REMARK 2.3. The inequality (2.22) holds for any regressor X and any 
estimator A = A(8). Further, the second term of the r.h.s. of (2. = is of higher 
order in general. To see this, assume that A = (X’S 'X) = O(1/n) 
or lim X'S \X /n > 0 exists as usual, and 6-@=O,1/ A Then 2, < 
E((A ~ A)*}'? = OQ/n) and from (2.18) g, < E[(W?(A — A)*}}/? = OC /n) 
at least under such a condition as the boundedness of W,. Hence, since 6 = O(1) 
as is easily seen from the definitions of F and 6, 5(g, + 302A) = O(1/n’), 
while oA = O(1/n). Therefore, our upper bound (2.22) preserves the order 
structure, contrary to the case in Remark 2.2. 


REMARK 2.4. It is interesting to see that if X is of the form P|? |Q with 
permutation matrices P and Q, 6 = 0, because 


LL’ = X'FZZ'FX = X'F |I — XX’ | PX = X'F?X — (XFX? 


implies tr LL’ = 0 and ty LL’)? < (tr LL’)? = 0. From Theorem 2.2 this implies 
that the GLSE is as efficient as the Gauss—Markov estimator. In fact, the 
following proposition holds in general. 


PROPOSITION 2.1. ô= 0 if and only if X’XZ = Q. 


Proor. 6 = Oimplies tr LL’ = 0, which in turn implies L = X'FZ = 0. Hence, 
from the definition of F, X, and Z, X’(I—AF)Z =0 or X2Z = X’2Z = 0). 
Tracing back this proof yields the converse, completing the proof. O 


The condition X’2XZ = 0 in Proposition 2.1 is well known as a necessary and 
sufficient condition for the OLSE (I) to be identically equal to the 
Gauss—Markov estimator (£) [see, e.g., Rao (1967)]. Hence, under this con- 
dition, B(1) = B(2) = B(%) and so Cow B(2)) = Cov( B(=)). Theorem 2.2 to- 
gether with Proposition 2.1 provides an alternative proof of the above result by 
Rao. 


3. Applications. In this section the results obtained in Section 2 are applied 
to two cases, regressions with (i) serial correlation and (ii) heteroscedastic covari- 
ance structure. 


EXAMPLE 1 (Regression with serial correlation). We consider the model (2.1) 
with errors of first-order serial correlation: 


(3.1) u,=Ou,,+e, 9@E€6= {8: |6| <1}, 


where u = (u,,..., U Y and {e,} ~ i.i.d. N(0, ož). As is well known, the inverse 


UPPER-BOUND PROBLEMS FOR RISKS 685 


matrix of the covariance matrix V of u is given by 
1 —§ 0 
-0 1+ 6° 
(3.2) V=! = — 
1+0? -6 
0 -0 1 
This matrix is often approximated by the matrix W`! in which the (1, 1)th and 


(n, nyth elements of V~! are replaced by 1 — 0 + 07. Then W™! = [(1 — 8)7J. + 
9C ]|/o”, where 


1 =l 0 
—1 2 
(3.3) C = . 
; 2 =] 
0 -1 1 
Using the scale invariance of GLSE BS), regard > in (2.1) as the inverse of 
(3.4) TI=1,4+XO)C withA=0/(1 - 8)’, 


where o° = of/(1 — 6”). Note — 1< À < œ from |6| <1. As in Durbin and 
Watson (1971), the eigenvalues of C are given by 
(3.5) d,=2[1—cos(7(j—1)/n)] (y= 1,...,n). 


It is noted that 0 < d, < max,.,.,d,=d, < 4 and d, + 0 except d, and that 


d,=0<d,< -= <d, So we can assume by (2.4) 

(3.6) x '=1,+A(0)D_ with D = diag(0, d,,..., dp). 

Next, choose an estimator 6 of 0 based on the OLS residual e such that 

(3.7) Hee <1, 6(-e)=6(e), 6(ae)=6(e) fora>Oand 
Ê is continuous. 

A typical choice will be 


n n 
(3.8) 6, = Dee. i/ Le; = e'Ke/ee, 
t=2 t= | 


where e = (€,, @9,...,e,)’ and K =(k,,) with k,, = 0 except k, 41 = kpi = 3 
(7 = 2,...,m, t=1,...,n — 1). Note that an application of Schwarz’s inequality 
shows |6,| < 1. The second and third conditions of (3.7) guarantee the unbiased- 
ness and the existence of the second moment of the GLSE (Ê), respectively. 


Define 
(3.9) A=6/(11-6)* and S=(14+AD)". 


Now we derive an upper bound for Co A(*)). Let @ = {0: |@| < 1}. Since 
0 = O, the inequality (2.22) in Theorem 2.2 is valid. 


686 Y. TOYOOKA AND T. KARIYA 


For the practical use it is necessary to evaluate ořA + 8(8, + &)o7A in the 
inequality exactly. But here we obtain an approximation to this. 
LEMMA 3.1. Assume 6 — 6 = O,1./ vn). Then 
W,=1+ 0,(1/vn). 
PROOF. For the evaluation of W,, we consider the behaviour of 


(3.10) (1+ Ad,)/(1 + XAd,). 
Expanding A = A(6) around 9, 


x A d . 
AN=AB) + (9 =B) AB) forð < 0* <@ 





(3.11) 


= A(0) + O,(1/vn). 


Remark that by Taylor’s expansion, 
wel 
|as cos| T J 
n 


2r 
= 4+ pra + o(1/n). 


=2 








(3.12) 


Put (3.11) and (3.12) into (3.10), 
1+ Ad, a+ 2ad/n 
1+Ad, a+ 2md/n+0,(1/vn) 
=1+O(1/yn) witha=1+ 4A>0, 
which completes the proof. O 


On the other hand, 
(3.13) a,<|E(w2(A-a)}]'" (= 1,2). 


Using Lemma 3.1, for the typical choice 6, in (3.8), the leading terms of the r.hs. 
of (3.13) are evaluated by the usual d-method as 


LEMMA 3.2. For the typical choice 6,, the leading terms of the r.h.s. of (3.13) 
are both evaluated as 


[ z{ w2( -a)*}]” = oes +o(1/n) (i= 1,2). 


Proor. Since d/d@ (6) = (1 + 8)/(1 — 6)", (By) — A(O) is approximately 
(8, — 9)(1 + 6)/0 — 6). By Lemma 3.1 W(X — A)! is asymptotically (6, ~ 
0A + 6)4/(1 — 6). Remark that vyn (6, — 8) converges to N(0,1/(1 — 8?)) in 


UPPER-BOUND PROBLEMS FOR RISKS 687 


distribution. Then by using the fourth-order central moment of a normal random 
variable, the result is obtained. 0 


Since 
(3.14) oA + (g, + %)07A s 07A + 8(A, + h,)o7A, 
where h, = [E{(W,?(A — A)*}]'7 (2 = 1,2), the leading term of the r.h.s. of (3.14) 


is obtained as 


THEOREM 3.1. For the choice 6, an approximation to the r.h.s. of (8.14) is 
1 ¥3(1 + v8(1+ 8) 
n (1~-6)' 


o7A + 26— 


EXAMPLE 2 (Regression with heteroscedastic covariances). The heteroscedas- 
tic covariance structure in (2.1) is given by 


0 In, 0 
(3.15) Cov(u) = 0 Aly, (6, > 0, 6, > 0). 
For the scale invariance of GLSE, without loss of generality, assume 


(3.16) S-10) = 





0 
j | with 8 = 0,/6). 


So the parameter space is 6 = {0: 6 > 0}. Then 
(3.17) = '(6) = FE, + A(@)D, 
where A(@) = @ — 1 and 
n, No 
D = diag{0,...,0)1,.--.1}. 


Hence from the inequalities (2.12) in Theorem 2.1 and (2.22) in Theorem 2.2, if 
0 € © with probability one, 


2 
o?A < R(Ê(È)) < i + E E|xs (ô - o WiLL] }o*A 
(3.18) ‘ide 


2 
2) 
< i +» s| (xn WEÔ - 0) Y lo 2A, 
i=] 

where o°? = 6,, B, = {Î — 0 2 0}, B, = {8-6 < 0}, W, = 1, and W, = (0/6)*. 
Here the expression of i, L’Lit, in dè original term : given in Remark Bl 
When @ = 6,/0@, = 1 is assumed so that ©, = {@ > 1}, a further evaluation is 
given as follows. 


688 Y TOYOOKA AND T KARIYA 


THEOREM 3.2. For 6 > 1, if 6 € ©, with probability one, 
o?A < R(Ê(È)) < {1 + (1 + 6?) E|(8 — 6)°H, L/L] \07A 


< {1+ (1+ 67) S| E(ô - 6)4]'“\o%A. 


A natural estimator for 0 is proposed as follows: let 





e’C.e (z= 1,2) 


' ni-k 
with 
n- N, 
C = diag TT, Dan 
and 
n, No 
C, = diag{ 0,...,0, besa) 
and let 


6, = max(1, 6, /6,). 


In this case, §, is a continuous, scale invariant, and even function of e. Then 
B() is unbiased and the second moment of B(2) exists. Therefore, R( BCS) = 
Cov( B(2)). 


As a special case, let us consider the problem of estimating the common mean 
of two normal populations and compare the upper bound for R($(*)) in (3.18) 
with the exact variance (risk). The problem has been treated by many authors. 
Among others, Graybill and Deal (1959) raised the problem and proposed a GLS 
type estimator, Cohen and Sackrowitz (1974) proposed different estimators using 
an ancillary statistic, and Khatri and Shah (1975) presented methods for evaluat- 
ing the exact variance of an estimator encountered in such a model. In our 
context, the model is given by y = XB + u, where Cov(u) = 6,2(6) is given by 
(3.15) and (3.16) and 


X =1=(1,...,1)':(n,+7,) x1. 
Then a GLSE (5) here is evaluated as 


ny, Ne Yo ny Ne 
3.19 $ = + x = tf -x ; 
(3.19) Aas) EE ; lis ; | 
where 5, = Eru, y/m and Ș = LIK", y/n with = (y), and 8, = ôe) is 
assumed to satisfy 6(—e) = 6(e) and 6 (ae) = b (e) for a> 0 (i = 1,2). First, 
we shall evaluate the upper bound in (3.18). From Remark 2.1 and (3.16), it is 
easily shown that 2} L’Liz, is evaluated as 


Z, y,rž ; = — 2 } 
Ŭŭ L’Lit, = nin3( ¥, — ¥,) /O(n, + nð). 


UPPER-BOUND PROBLEMS FOR RISKS 689 


Hence from (3.18) the upper bound for the covariance matrix R( B(3)) is given by 


nine A, py 5) B.A 
O(n, +n) VO UNEP TP 


where A = (X’27'X)7! = (n, + n,@)7'. On the other hand. the exact variance 
of B(S) in (3.19) can be evaluated in a line with Khatri and Shah (1975). In 
particular, if 6, is a linear combination of s?, s?, and (J, — j,)”, we can use their 
result where s” is the sample unbiased variance of each pcpulation. However, 
even in this case it is difficult to analytically compare (3.20) with the exact 
variance provided by them. For this reason, we make use of Lemma 2.1 and 
compare them indirectly. Since from (3.18) 





(3.20) f +E 





a ny, + nj} ny, + n2¥0 
Ne a oes A 
B(2) — B(2) n, + nf n, + nð 
7 nna — 8)( % — J,) 
(n, + n9)(n, +n) 


the exact variance of A(2) is obtained from Lemma 2.1 as 
ning3(6 a 8)°( Yon > a) 0A 
O(n, + n8)(n, + nob)” l 


(3.21) Cov( B(2)) = l + r| 


Therefore the difference between the upper bound (3.20) and the exact variance 
(3.21) is 


n2n2(6 — 6)°(¥, — ¥,)° 
t =0,AE inal "Cy. a) 
Oin, + nf) 
(3.22) 6\? (n, +n6\* 
Xlxp +xalz<} -—|——-> 
AN xal 4 n, + naĝ 





= E(nin§(6 - 8)°( 52 — 3I - JQ} An, + n0), 
where 


2n.Ox p + 2N, : p Ay. 1 
= ! RPS Oe ee ee) 
O- Watay t OTO a (Fag) 


If ĝ — 0 = O,(m~'”) with m = min(n,, n,), then the difference { of the upper 
bound and the exact variance is of order O{(m~°/*) under regularity conditions. 
Hence the m?{ = O(m) and thus the upper bound is asymptotically equiv- 
alent to the exact variance up to O(m~*). When 6, is independent of (J, — y,) 
as is the case of 6, = s? (i = 1,2), (Ja — J,)? can be replaced by E( 3, — y= 
(n, + n.0)8,/n,n, in (8.20), (3.21), and (3.22). If 6 = s*, using (n, — 1)s2/ 
(na — 1)s¢ = 00/(1 — v) with v being distributed as beta((n, — 1)/2, (na — 1)/2), 
we can also evaluate (3.21) exactly if necessary. 


690 Y. TOYOOKA AND T. KARIYA 


Acknowledgments. The authors are deeply grateful to the referees for their 
invaluable suggestions and comments. Portions of Kariya’s work were supported 
by the Ministry of Education of Japan under General C59540104. 


REFERENCES 


Basu, D. (1955). On statistics independent of a complete sufficient statistic. Sankhya 15 377-380. 

COHEN, A. and SACKROWTTZ, H. B. (1974). On əstimating the common mean of two normal 
populations. Ann. Statist. 2 1274—1282. 

DURBIN, J. and WaTsON, G. S. (1971). Testing for serial correlation in least squares regression III. 
Biometrika 68 1-19. 

GRAYBILL, F. A. and DEAL, R. B. (1959). Combining unbiased estimators, Biometrics 15 543-550. 

KARIYA, T. (1981). Bounds for the covariance matrices of Zellner’s estimator in the SUR model and 
2SAE in a heteroscedastic model. J. Amer. Statıst. Assoc. 76 975-979. 

KARIYA, T. and Toyooka, Y. (1985). Nonlinear versions of the Gauss- Markov theorem and GLSE. 
In Multicariate Analysis (P. R. Krishnaiah, ed.) 6 345-354. Elsevier, New York. 

KHATRI, C. G. and SHAH, K. R. (1975). Exact variance of combined inter- and intra-block estimates 
in incomplete block designs, J. Amer. Statist. Assoc. 70 402-406. 

McE roy, F. W. (1967). A necessary and sufficient condition that ordinary least-squares estimators 
be best linear unbiased. J. Amer. Statist. Assoc. 62 1302-1304. 

Rao, C. R. (1967). Least squares theory using an estimated dispersion matrix and its application to 
measurement of signals. Proc. Fifth Berkeley Symp. Math. Stattst. Prob. 1 355-372. 
Univ. California Press. 

THEIL, H. (1971). Principles of Econometrics. Wiley, New York. 

Toyooka, Y. and Kariya, T. (1983). Uniform bounds for approximations to the pdf's and cdf’s of 
GLSP and GLSE. Discussion paper 84, Hitotsubashi Univ. 


DEPARTMENT OF APPLIED MATHEMATICS INSTITUTE OF ECONOMIC RESEARCH 
OSAKA UNIVERSITY HITOTSUBASHI UNIVERSITY 
TOYONAKI-SHI, OSAKA 560 KUNITACHI-8HI1, TOKYO 186 


JAPAN JAPAN 


The Annals of Statutics 
1986, Vol 14, No 2, 681-707 


INTERGROUP DIVERSITY AND CONCORDANCE FOR 
RANKING DATA: AN APPROACH VIA METRICS FOR 
PERMUTATIONS 


By PauL D. FEIGIN? AND MAYER ALVO? 
Technion and University of Ottawa 


Motivated by the apportionment of diversity analysis due to C. R. Rao, a 
general approach to comparing populations of rankers is proposed. Each 
permutation metric corresponds to a particular population characteristic that 
forms the basis of the comparison. Tests of hypotheses concerning equality of 
characteristics are developed. Throughout, comparison is made with earlier 
work, most of which is based on the use of only the Spearman metric. 
Extension to tied rankings is discussed. Examples for two groups are pre- 
sented which illustrate the computational feasibility as well as the value of 
the proposed procedures. 


1. Introduction. There has recently been a revival of interest in the analysis 
of rank data (or rankings) and even some controversy concerning the appropriate- 
ness of various proposed statistics [see Hollander and Sethuraman (1978) plus 
comments by Schucany, and the paper by Kraemer (1981)]. Our aim here is to 
propose a general framework within which the various anelyses can be investi- 
gated as well as to suggest some alternative analyses. We will concentrate on the 
situation of two or more populations of rankers or judges [for the single popula- 
tion case some related material may be found in Alvo, Cabilio, and Feigin (1982) ], 
and our interest lies in determining whether and how the populations differ in 
the way they tend to rank a fixed set of r objects. 

Although the rankers may apply some absolute scoring system to each object, 
with the resultant scores determining the ranking, these scores are not observed 
—the only datum each ranker provides is his ordering oi the r objects with 
respect to some criterion. Various models have been proposed for such data, for 
example: ones that are based on an absolute scoring process [see the recent paper 
by Pettitt (1982) which also considers groups of judges]; or ones that try to 
reflect the paired comparisons approach to determining a ranking [see Mallows 
(1957) and Feigin and Cohen (1978)]. Here, we will not consider parametric 
models per se, but will be concerned with describing these characteristics of 
models with respect to which one may wish to compare different groups of 
judges. 

The model or population characteristics were arrived at initially by applying 
methods of analysing diversity which have recently been developed by Rao 


Received August 1984; revised May 1985. 

! Research supported in part by VPR Fund—Lawrence Deutsch Research Fund. 

? Research supported in part by NSERC Grant No. 9068. 

AMS 1980 subject classifications. Primary 62G10, 62G05; secondary 62J10, 20B99. 

Key words and phrases. Intergroup concordance, Kendall’s tau, Spearman’s rho, Spearman's 
footrule, tied rankings. 


691 


692 P. D. FEIGIN AND M. ALVO 


(1982a, 1982b). Although, in retrospect, one may also begin by defining the 
various characteristics directly, we feel that it is nevertheless instructive 
and useful to present the analysis of diversity as the point of departure (viz. 
Section 2). 

Before proceeding with a general comparison of various approaches to analys- 
ing rankings, we pause to define a few quantities. A ranking of r objects can be 
described by a permutation, w, which is an element of Q, the set of all k = r! 
permutations (or the symmetric group on r letters). The possibility of rankings 
with ties is discussed briefly later; meanwhile it is denied. A stochastic model for 
the selection of a ranking by a judge is a probability vector 7, which is of 
dimension k and whose ith component (7), is the probability that w,(€ Q) is 
selected by the judge. The permutations w,,..., w, are numbered in an arbitrary 
but fixed way. Each judge in a particular group (or population) selects a ranking 
according to the same a and independently of the other judges. Equivalently, we 
can regard the population distribution of judges (as far as ranking the r objects is 
concerned) as given by 7; and the particular group of judges as a random sample 
from this infinite population. Comparing g populations of judges then amounts 
to trying to compare the g probability vectors (models) 7,, 7,..., 7,. If a sample 
of n, Judges from population 7 is taken we denote by {X,,, J = 1,..., n,} the set 
of rankings chosen by these judges. Summarizing the above, we have, for each 
J=],...,n, 


P(X, =,)=9(1),  1=1,...,k. 


When referring to the comparison of groups of judges the terms agreement 
and intergroup concordance have been used in the literature. The question of 
comparing populations of judges then involves two stages: firstly, defining agree- 
ment or measuring degrees of agreement; and secondly, estimating or testing for 
levels of agreements based on the samples (or groups) of judges available. 

The recent literature on agreement has focussed on analysis based on the 
average rank statistic for each group, whether or not agreement itself is defined 
in terms of the expected rank vector. One of the innovations proposed here is the 
possibility of considering other statistics corresponding to other characteristics of 
the population models. For example, the average rank vector is directly related to 
the Spearman rank correlation, p, whereas one may wish to consider that 
characteristic vector associated with Kendall’s rank correlation, r. 

The characteristics of a population or model (determined by 7) may be 
thought of as a mapping Tr of 7 into a lower dimensional Euclidean space R* 
(s < k = r!)—we only consider linear maps T here. We note that Pettitt’s (1982) 
approach, for example, may be regarded as involving a particular nonlinear map 
of a vectors which lie in a (r—1)-dimensional subspace of R*. Given T, 
agreement itself may be defined in terms of the vectors T7,,..., Tm; [see, e.g., 
Kraemer (1981)] and/or the means of testing for agreement, however defined, 
may be based on the statistics T/,,..., Tf,; where f, is the relative frequency 
vector of the permutations (rankings) in group 7 of judges. 

One of the main reasons for considering the statistic Tf instead of f itself is 
the dimensionality of the latter. For r as small as 5, the analysis based directly 
on f would need to take place on the unit simplex in R’°—a task that would 


INTERGROUP DIVERSITY AND CONCORDANCE FOR RANKING DATA 693 


require more data than is usually available in order to apply standard asymp- 
totics. Another important motivation for considering characteristics Tr 1s that 
permutations or rankings that differ do so to different degrees. In other words, 
one would like to incorporate the notion of distance between permutations when 
comparing groups of judges. This last point leads us to our approach via 
measures of diversity and forms the subject of the next section. 

In terms of the notations of this paper, the controversy alluded to earlier may 
be paraphrased as follows. Hollander and Sethuraman (1978) define agreement 
between two populations of rankers as 7, = 72, and suggest testing agreement by 
comparing Tf, with Tf, where T is the mapping which gives the average 
(centred) ranks vector. Schucany and Frawley (1973), for example, regard com- 
plete absence of agreement as nonpositive correlation between the vectors Tr, 
and Tr, (same T as before). They are also concerned with the idea that if 7, or, 
actually Tr, is a constant vector then there is no consensus in population 1 and 
so no possibility for agreement between populations 1 and 2. Kraemer (1981) 
develops this idea further by defining a relative measure of intergroup concor- 
dance—relative to the average intragroup concordance. We will make some more 
explicit references to these approaches in the sequel, although our results, to some 
extent, develop along the lines proposed by Hollander and Sethuraman’s (1978) 
analysis, but deal with a larger variety of statistics. Of particular interest is the 
characteristic corresponding to Kendall’s 7, which forms the basis of the numeri- 
cal examples presented in Section 5. 

As the referee has pointed out to us, in his forthcoming monograph, Diaconis 
(1985) also presents one way to use metrics on the permutation group in order to 
test for agreement among populations of judges. His approach is based on 
constructing a minimal spanning tree of the data and comparing the number of 
edges joining rankings of different populations with the number of edges which 
join rankings from the same population. We hope to compare this rather different 
approach with ours in the future. 


2. Measuring and apportioning diversity for populations of rankers. 
The measurement of diversity of a population has a long history, particularly in 
the biological sciences such as genetics. Coincidentally with the renewed interest 
in rank data, statisticians have recently returned to the problem of measuring 
diversity [see, for example, the paper (with comments) by Patil and Taillie 
(1982)] and one approach, espoused by Rao (1982a, 1982b), forms the basis of our 
analysis. 

In contrast to the entropy-type measures of diversity which are simply 
functions of {7(1), 7(2),...,7(k)} without any regard to the ordering of the 
categories (in our case—permutations), Rao suggests using measures which 
incorporate distances between categories. In light of the comments made earlier, 
this suggestion seems eminently appropriate to the analysis of rankings. 

Consider a set Q of k points w,, w.,..., 0, and let {5,,: i, J = 1,..., k} denote 
the set of “distances” between pairs of points, i.e., ô,, is the distance between w, 
and w, In Rao’s (1982a) formulation, ô need only be nonnegative and need not 
satisfy the properties of a metric—in other words, ô measures some concept of 
distance between points. The diversity coefficient of a population can now be 


694 P. D. FEIGIN AND M. ALVO 


defined in terms of the expected distance between two members selected indepen- 
dently from the population according to 7: 


DEFINITION 2.1. The diversity coefficient (based on 6) of the population with 
probability vector 7’ = (a(1),..., (kY) on Q is 


(2.1) H( a) = n'Ar 
where A = (ô,,) is a (k X k) matrix. 


Applying this definition to the rankings situation simply involves choosing a 
measure of distance between permutations (u, 1 € Q say). Three such measures 
used in statistical applications are given below and are related to Spearman’s 
p(S), Kendall’s r(.K ) and Spearman’s footrule (F'), respectively: 


(22) dalun) = 5 È [n(s) = n(o)]*= [rC + Dr + 1)/6 ~ wa]; 


(2.3) dg(u, n) = 3 {1 — sgn[a(s) — w(t)]sen[n(s) — n(t)]}; 


(2.4 de(usn) = 5 È lels) = (8)] 


Of course, many other metrics have been defined on the set of permutations and 
the particular one that the statistician chooses to use may involve consideration 
of the actual processes which determine the choice of a ranking by a judge. 
Alternatively, more robust conclusions may be reached by taking into account 
several different metrics when analysing the same set of data. It is this flexibility 
which we wish to pursue here, and not the determination of a particular metric as 
universally superior. 

Thinking of diversity as a generalization of the notion of variance, and since 
we are interested in comparing several populations, the next step is defining the 
diversity between populations versus that within populations—or, as Rao (1982a) 
calls it: the apportionment of diversity. We note here that ds is the square of a 
Euclidean metric and so leads directly to a standard analysis of variance. This 
fact gives further insight into the popularity of analyses based on Spearman’s p. 
Although dg, is itself not a metric, we nevertheless will refer to it as the 
Spearman metric in the sequel. 

Suppose g populations with probability vectors 7,,...,7, are mixed together 
according to the proportions A,,..., A, such that A, +A, + --- +A, = 1, thereby 
forming a new population with probability vector m = 4_,A,7,. Following Rao 
(1982a) we now turn to: 


DEFINITION 2.2. Suppose the diversity H(-) of (2.1) is a concave function on 
S,, the unit simplex in R*. Then the total diversity H(7) can be apportioned 
into the within populations diversity 


(2.5) X A,H(7,) 


t=] 


INTERGROUP DIVERSITY AND CONCORDANCE FOR RANKING DATA 695 
and the between populations diversity 


& 
(2.6) = LAA, (a, — 7,)'A(m, = a). 


i<y 


The concavity requirement is to ensure that the between populations diversity 
(or discrimination coefficient) is nonnegative. In terms of the distance 4, this 
condition is equivalent to the requirement that 

k 
(2.7) a’Aa <0 whenever )) a(s) =0 


s=i 
or equivalently, that 
(2.8) A* = (6, + 8, — 4, i) be nonnegative definite. 


It is interesting to note that requiring that ô be a metric on Q is not sufficient to 
ensure (2.7) or (2.8) hold, so that for the application to rankings one has to verify 
(2.7), say, for each potential distance measure. 

For metrics on the set of permutations, we have that the desirable property of 
right invartance [see Diaconis and Graham (1977) or Alvo et al. (1982)] ensures 
that the k vector e = (1,1,...,1)' is an eigenvector of A. We therefore have: 


LEMMA 2.1. Jf ô is a right invariant metric on the set of permutations then 
there exists c > 0 such that 


Ae = ce 
and H(-) = H\(-) is concave if and only if 
(2.9) Q = (c/k) J — A is positive semidefinite, 


where J = ee’. Moreover, in this case H(1) has the maximum value B = c/k at 
n = u = ({1/k)e. 


Proor. The existence of the eigenvalue c follows from the right invariance 
property as referred to above. 

If (2.7) holds, then since Qe = 0, x’Qx > 0 for any x = b + ae with b'e = 0; 
but this includes all x € R*. The converse is immediate. 

Writing 7 = u + ("r — u) and since u is an eigenvector of A orthogonal to 
(7 — u) the result follows from 


H(7) = u’Au + (r — u)'A(a — u) < u'Au = c/k. o 


This last result says that the uniform distribution over Q is most diverse for 
diversity measures based on a right invariant metric. 

The fact that (2.9) is valid for A based on dẹ and dx, [see (2.2) and (2.3)] 
follows from Alvo et al. (1982) or from the form of (2.2) and (2.3) (see Lemmas 3.1 
and 3.3). That (2.9) is also true for the footrule metric (2.4) is less obvious and is 
proved in Lemma 3.4. Note that the restriction to right invariant metrics is a 
natural one in the context of rankings. 


696 P. D FEIGIN AND M ALVO 


If the rank of Q is less than or equal to s, then standard matrıx theory implies 
the existence of an (s X k) matrix T such that 


(2.10) Q= TT = (c/k)J — A. 


We can now gain further insight into the nature of the between groups diversity. 
For X a random vector we let var( X ) denote its variance covariance matrix. 


LEMMA 2.2. If 6 is a right invariant metric on the set of permutations 2, 
then for H defined by (2.1), the between populations diversity is given by 


g 
(2.11) LAA MPa, — Trl = tr{var(Tr;)} 

<J 
where ||- ||,, is the Euclidean norm in R™ and I has the distribution P(I = i) = 
A, t= 1,...,2. 


Proor. The result follows immediately from (2.6) by substituting A = 
(c/k)dJ — T’T and expanding the || - ||? term. O 


The expression (2.11) allows us to interpret the apportionment of diversity 
based on 6 in terms of the variability of Tr, i = 1,..., g. Moreover, it is the 
characteristic Tr of the model 7 which forms the basis for comparing popula- 
tions if one does so using a diversity measure based on a right invariant metric 6. 
This interpretation of the apportionment of diversity is the topic of the next 
section. 


3. Defining and interpreting model characteristics. We pursue the im- 
plications of Rao’s apportionment of diversity for the analysis of rankings based 
on right invariant metrics. In so doing we arrive at a way of interpreting T 
matrices for given metrics as well as showing how the characteristics so defined 
have been or may be used to compare populations of judges. We concentrate on 
describing the T matrices corresponding to the Spearman, Kendall, and footrule 
metrics. 

For the case 6,, = dg(w,, w,), we may identify T as follows. Define the centered 
rank vector 


to: R > R” 
by 
r+] , 
2 





r+] 
(3.1) tlo) = | (1) =- =,- 0(r) - 
and let the (r X k) matrix Tọ be defined by 
(3:2) T, = (i,(o,),..-; tg(,)). 


LEMMA 3.1. For ô corresponding to dg, the matrix T in the decomposition 
(2.10) is given by T, of (3.2), and the characteristic Ta corresponds to the 
expected centred rank vector. 


INTERGROUP DIVERSITY AND CONCORDANCE FOR RANKING DATA 697 


Proor. Let the rank vector v: Q —> R” be defined as v(w) = talw) + 
(r + 1)/ 2e and 


(3.8) V = (0(0,),..-, v(w,)). 
Then 

1 
(3.4) T=r=|1-23])V 
and 


(3.5) T'T=(VV -4r(r + 1*3) 
= (1/12)r(r + 1)(r- 1)J - A, 


the last inequality following from the definition of dg (2.2). Since, by construc- 
tion, Te = 0, we conclude that we have discovered the decomposition (2.10) with 
c = kr(r + 1X(r — 1)/12. 

Furthermore, 


k 
Tr = )) a(i)ts(w,) = E,ts(o), 


the expected centred rank vector. 0 


Much of the literature on analysing rankings is based upon the expected rank 
vectors. For the problem of comparing two populations of rankers, Schucany and 
Frawley (1973) and Li and Schucany (1975) consider a statistic which may be 
regarded as an estimator of (Tr) (Tr): a measure of “covariance” between the 
two vectors Tr, and Tr,. Kraemer (1981) defines a measure of intergroup 
concordance p which, in our notation, is given by 


(3.6) p =V = w) EAV = i 


where 7 = LA 7, and u = (1/k)e as before. Kraemer considers the case À, = 1/g; 
i= 1,2,..., g. We can rewrite p in terms of the within and total diversity. 


LEMMA 3.2. In terms of the apportionment of diversity ( Definition 2.2) based 
on the Spearman metric, the intergroup concordance p is giwen by 
(3.7) p =|\Tnl2/{ CAITR?) 
(3.8) = (B -— total)/(8 — within), 8 = c/k = r(r + 1)(r — 1)/12. 


Proor. From (3.3) and (8.4) we have 
Via — u) = T(r — u) = Tr 
so that, via (2.9) and (2.10), 
Tri? = r'Qr = (c/k) — n'Ar = B — H( n) 
and (3.7) and (3.8) follow. 0 


698 P. D. FEIGIN ANID M ALVO 


Note that if 7, =u all ¿ then p is undefined—otherwise it represents the 
proportion of concordance attainable between groups given the level within 
(Kraemer, 1981). In fact, we may regard concordance (C) as the complement of 
diversity ( D) from the relationship 
(3.9) C=ßgß-D. 

Thus, Kraemer’s coefficient ineasures the concordance ratio 
p = C(total) /C(within) 

whereas the apportionment of diversity would lead to the dispersion ratio 
a = D(within) /D(total). 


It then becomes a more philosophical issue whether dispersion (distance) or 
concordance (similarity) is the appropriate criterion. It is true that whenever C 
(within) = 0, p is undefined whereas a = 1. This case corresponds to that of 
similar but completely internally discordant groups. Is it the similarity or is it 
the complete discordance that one wants to measure? 

The form (3.7) could, of course, also serve to define a measure of intergroup 
concordance based on another right invariant metric, for example, that based on 
Kendall’s tau [viz. (2.3)]. In order to interpret the latter we quote the following 
result. 


LEMMA 3.3. Let ty: Q> {-1, +n be defined by 


(3.10) (tx (w))(8) = (sen{o(y) - o(2)}), 8 =1,2,.5(9], 
where 

s=(i-1)\(r-i/2)+()7-i), ls<i<je<r. 
Then the (;,| X k matrix T 


T = Ty = (tx(w,),.--, te (w%)) 
satisfies (2.10) for A based on the Kendall tau metric (2.3). 


Proor. Straightforward, since for the definition (2.3) 


(r— 1) 


r 
dx (a, n) = 5 ~ t(n t(n), 


so that 
A = [r(r— 1/2 J - TT. g 


The «naracteristic Tyr may therefore be regarded as the expected pairwise 
concordance vector, where concordance is measured with respect to the identity 
permutation (1,2,...,7). Thus the measure p of intergroup concordance with 
T = Tę in (3.6) would look at relative agreement based on average pairwise 
decisions—a more sensitive criterion than that based on average ranks. Whether 
or not this extra sensitivity reflects relevant aspects of concordance or agreement 


INTERGROUP DIVERSITY AND CONCORDANCE FOR RANKING DATA 699 


between judges must, of course, be ascertained from the particular context. At 
the very least, however, it provides a further tool for comparisons and contrasts. 

In order to interpret the characteristic Tpm corresponding to the diversity 
measure based on the Spearman footrule metric (2.4), we obtain the following 
result after some algebra. 


LEMMA 3.4. Let tp: Q > R” be defined by 
(tp(w))((i-1)rt+j)=Ilwli) <j] -j/r, lsijs<r, 


where I[ w(t) <j] equals one when w(t) <j and is zero otherwise. Then the 
(rt x k) matrix T gwen by 


T= Tps (t,(,), e. tplwg)) 
satisfies (2.10) for A based on Spearman’s footrule. 


Proor. ‘The proof amounts to showing that 


[te(u)]'te(n) = (r + 1)(r — 1)/6 ~ dp(u, n) 
using the fact that 


max(a(i), n(i)) = d[n(z) + w(é)] + zjali) — n(4)]. o 


The characteristic Tpm may therefore be regarded as the set of (centred) 
distribution functions for the ranks of each item. If we write 


F(j)=Plwli)<j)-j/r, 1lsjsr, 


then Tpr is equivalent to the set {#,,..., F). This characterization is only 
based on the marginal distributions of the ranks for each item, and so takes no 
account of possibly important dependence relationships between these ranks. The 
same is of course true for 7,7—which merely considers averages for each 
item—whereas Tgr is sensitive to certain patterns of dependence in the alloc- 
ation of ranks to each of the objects. 

In the light of the above interpretation of model or population characteristics, 
we may look again at those analyses of agreements suggested previously. 
Hollander and Sethuraman’s (1978) statistic is sensitive to departures from 
Tor, = Tor. An extension to the case of g groups was considered by Katz and 
McSweeney (1981): It amounts to treating disagreement as occurring if Tor, # 
Tor, for some 1 < ¿, J < g. Using other characteristics Tr instead of Tgr, other 
types of disagreement among populations of judges may be investigated. We 
develop this approach in the sequel, particularly for the pairwise concordance 
characteristic T, 7. 

For other approaches to defining agreement, we refer to Li and Schucany’s 
(1975) discussion of various comparisons of populations of judges. The approach 
ascribed to Quade amounts to saying that the two populations agree if 


? Peres / 


where A= Ag or Ax. This approach seems to ignore the fact that the two 


700 P. D FEIGIN AND M. ALVO 


populations could be equally concordant about different rankings. Linhart’s 
(1960) definition is essentially equivalent to Quade’s in that m'Asr is directly 
related to the coefficient of concordance (viz. Kendall’s W) for the population 
[see Alvo et al. (1982)]. 

Finally, we reiterate the point made in the introduction—that the characteris- 
tic ¢: QR -» R* which describes the information used in comparing rankings may 
itself be chosen directly by the researcher and need not be derived from a 
diversity argument based on a particular metric. In this case, however, there is a 
certain ambiguity in the definition of the corresponding A (distance matrix). We 
assume T is chosen so that Te = 0 and then set 


AT) = BJ — T'T. 


DEFINITION 3.1. A is called the minimal distance matrix corresponding to T 
if it is nonnegative with at least one zero element (on the diagonal) and equals 
A(T) for some 8 = B(T). 


From the definition it is clear that 
A(T) = max{(T’T),,, i= 1,..., k}, 


and that if T is defined in terms of a metric 6 then the minimal distance matrix 
corresponds to the original matrix A = (6,,) (with zeroes on the diagonal). 

In the sequel we will refer to the minimal distance matrix corresponding to T 
as simply the distance matrix corresponding to T. 


4, Inference for measures of agreement. We will refer to three essentially 
different ways of making inferences about the level of agreement observed among 
a set of g groups of rankers. They involve procedures: 

A. based on the randomization distribution of a statistic; 

B. based on jackknifing or related methods; 

C. based on the asymptotic distributions of a statistic. 

For each approach the actual statistic to be used may be chosen either a priori 
or based on the corresponding analysis for the asymptotic (multivariate normal) 
experiment. Furthermore, we have the option of deriving versions of a given 
statistic type by choosing that model characteristic Tr in terms of which we wish 
to assess the level of agreement. 

The data available, although probably not presented in this way (see below 
concerning computational aspects), may be summarized as follows. For i = 
1,..-,8, the n, judges of group 7 assign rankings and the resulting relative 
frequency vector f, is defined by 


(4.1) j,(¢) = n7! x (no. of times w, assigned in group i), 
f=1,2,...,kR=r!. 
The analysis of agreement may now proceed by defining statistics related to the 


corresponding population quantities—that is, we replace r, by f, in the relevant 
formulae. Given T [and the corresponding distance matrix A and value £ = B(T) 


INTERGROUP DIVERSITY AND CONCORDANCE FOR RANKING DATA 701 


—see Definition 3.1], we begin by considering the following two statistics: 


(4.2) a= W/S, 
(4.3) p= (8 - S)/(B -— W) = (B/S ~- 1)/(B/S — â) 
where 


& R 
W= YA AAL=B- $ AJITH’, 
t= | 


t=] 


S= f'Af = B — ITFI”, 
g 
f =, 3 À, fs 


t=] 
and 
(4.4) A,=n/N, Neon, + nyt eR 


or, possibly, {A,} is an a priori set of weights for the g populations, independent. 
of the sample sizes. We assume in the sequel that the À, are determined by (4.4). 

At a descriptive level, both â and p measure the relative degree of intergroup 
concordance—values near one indicating stronger agreement between the groups. 
The statistic â corresponds to the proportion of total diversity due to intragroup 
differences (viz. Definition 2.2); whereas p is a generalized version of Kraemer’s 
(1981) measure (viz. Lemma 3.2). These statistics to some degree answer the 
question: What level of agreement do the populations reveal? While a refers to 
the relative diversity (within /total) p relates to the relative similarity [(overall 
similarity) /(within similarity) ]. 

To make inferences concerning the corresponding population quantities, we 
may employ procedures of type A or B. For testing, using A, with the null 
hypothesis of complete agreement 


(4.5) Hy 2, = 1, i=1,...,g (implies a = p = 1), 


the randomization (permutation) distribution gives equal probability to each of 
the M = N!/T1}.,(1,!) partitions of the N judges into g groups of sizes n,,..., Ng. 
The significance level of the data is then the relative frequency of values of the 
statistic (@ or 6) that fall below the value observed for the actual partition. 
Approximations for moderately large N are also available. We refer the reader to 
Mielke et al. (1981) for some details and references. Hollander and Sethuraman 
(1978) derive their test statistic from the asymptotic approximation of the 
randomization distribution of Tof. However, we find it more natural to discuss 
their approach with respect to procedures of type C. 

For confidence intervals for the population quantities a or p, one may follow 
Kraemer’s (1981) suggestion and compute the jackknife estimate of the parame- 
ter as well as its standard error viz. procedure B above. This procedure involves 
recalculating the statistic, leaving out each ranking one at a time, and then using 
the value from the ¢ distribution on the appropriate number of degrees of 
freedom. Mosteller and Tukey (1977) give the general theory and Kraemer (1981) 
discusses the application to the particular case of p based on Tf (i.e., the average 


702 P. D. FEIGIN AND M. ALVO 


ranks vector). The extension to a and to general T is straightforward in principle. 
Moreover, ascertaining the probability value of the hypothesis a = 1 (or p = 1) 
can be determined approximately using the same t approximation. 

The bootstrap is an approach related to the jackknife, which could also be 
applied to the analysis of â or 6. The bootstrap samples would each contain N 
pairs (Z, w) randomly chosen with replacement from {(z, X,,): j= 1,..., zn; 
t= 1,...,g} and then the standard Monte Carlo analysis would allow one to 
estimate biases and standard errors as well as to construct confidence intervals. 
(Here X,, is the ranking of judge J in group 1.) A succinct account of these ideas 
appears in Efron (1982). 

The more classical statistical approach is to consider the asymptotic distribu- 
tions of the statistics involved and this approach is readily applicable here also. 
We have, denoting the p-variate normal distribution by N,: 


THEOREM 4.1. Forn /N —>’,>0asN~ œ 


1Th-meu, i=1,...8, 
where 
U, ~ N,(0, T 2,7"), U,,..., U, are independent and 
3 =N,- rm, 1, = diag(a(1),..., 7, (k)). 


Proor. A straightforward application of the central limit theorem for multi- 
nomial vectors. O 


Based on the asymptotic formulation, a test for agreement based on the 
characteristic Tr, i.e., of 


HT): Tx, = Tr, = -- = Trg, 
amounts to a multivariate analysis of variance with nonhomogeneous covariance 
matrices. If we regard the null hypothesis as 


Ay: m = m = “eer = Mg 


then under H, the covariance matrices are homogeneous and classical MANOVA 
may be applied. 

The approach proposed by Katz and McSweeney (1981) is designed for testing 
HT) and allows for the nonhomogeneity of variances. Although they only 
consider T = Tg, there is again no major difficulty in applying their approach to 
other T matrices. 

We will concentrate on the situation of two groups of judges (g = 2). 


THEOREM 4.2. For g= 2 and under H,(T), the conditions of Theorem 4.1 
imply 
VN T( f, ~ fe) = Ns(0, TET’), 
where 
(4.6) SATS, AG 2S. 


PROOF. Straightforward. 0 


INTERGROUP DIVERSITY AND CONCORDANCE FOR RANKING DATA 703 


COROLLARY. Suppose È is a consistent estimate of £ and that D is the 
Moore-Penrose inverse of TST’. Then, under HAT), 


yn = N( f- fh) T'DT( f - h) > x2 
where v = rank(T ZT’). 


Proor. Since D is consistent for D, where D is the Moore-Penrose (gener- 
alized) inverse of TET’, the result follows from standard multivariate normal 
theory and the continuity theorem of weak convergence. O 


One way to circumvent the need to use generalized inverses is by choosing T so 
that TÈT is of full rank. Thus the ((r — 1) x k) matrix T only uses the ranks 
of the first (r — 1) objects and thus avoids the obvious singularity incurred by 
using T,. Of course singularities may also derive from the particular form of 2. 
For the case of T = T, [of dimension E XxX k], there is no a priori singularity in 
the matrix T 27”. 

It is now important to decide on how to estimate £ of (4.6). We write 


$ =n [F -fh (an1), tl. g, 
where 
F, = diag( f,(1),..., f,(R)). 


There are basically three different estimates of È, depending to some extent on 
which null hypothesis is entertained: 


$, = N(njz12, + Hoe. ), 
oe = [| N?/nynq]((m iii 1)3, te (tg 1)%,)[1/(N - 2)], 


N-2 N 
2p T reamed i i fa)( fi 55 fay. 


se-la N-1 


The estimate we (separate) is appropriate when H,(7') is considered to be the 
null hypothesis since in this situation we may not assume that the variance 
matrices (dispersions) are equal. The estimate 2, (pooled) is obtained by pooling 
the estimates > and >, and is appropriate under H,. Hollander and Sethuraman 
(1978) actually use SA (combined sample) which is the estimate of dispersion 
based on the combined sample of size N. In the MANOVA context, (N — 1)2, is 
the total dispersion whereas (N — 2)Ep is the within groups dispersion—the 
latter is more commonly used to estimate the common dispersion. 

Returning to testing for agreement, we recall that the relevant dispersion is 
TT’ which is of dimension s, typically much less than k = r!. It is this lower 
dimensionality that makes both the inversion of TÈT feasible as well as the 
(asymptotic) normality a better approximation [see Remark 5(d) of Alvo et al. 
(1982)]. Based on the corollary, we therefore have an asymptotically x? statistic 
for testing the hypothesis of complete agreement [H,(T) or H]. Alternatively, 
we may use the (approximate) F statistic appropriate to the (asymptotic) 
multivariate normal situation [see Rao (1965, Section 8d.3)], 


oy = {(N - b- 1)/[(N - 2)v] }yy = F N-i 





704 P D FEIGIN AND M. ALVO 


for yy based on the (generalized) inverse of TST (appropriate for the case of 
testing H). In this way we take into account the sampling error of estimating =. 

The matrix T has columns /,,..., ¢, which in turn defines a mapping from Q 
to R* as follows: for w € Q 


t(w)=t, ifw=w,. 


In Section 3 we have seen how to construct (w) for T= Te, Tg or Tp. The 
matrix TS T is simply the estimate of the (s X s) covariance matrix based on 
the vectors 


in other words, 


(4.7) TÈT =(n,—1) ° > (E(X) -È )(e(X,) -Y 
where 
(4.8) E = n7? E (X). 


These equations illustrate how the computations may be done in R*, with 
matrices of dimension (s X s), and there being no need to handle the (k X k) 
matrices ©. Even for r= 10 and T = Tę, we have s = (7}= 45—leading to 
matrix inversions well within the capabilities of today’s mini-computers. 

In terms of the mappings t we may also deal with the problem of ties. The 
idea is to extend the definition of ¢ to the domain of possibly tied rankings in a 
linear way. For example, for the tied ranking 


¢ = (142.5255) 


we may consider it as the average of 


n = (142385) 

up = (14325) 
and define 
(4.9) t(f) = (tln) + t(p)). 


For t = tg we simply have 
tof) = (-21 —0.5 —0.5 2)’, 
whereas for t = tx 
t,(¢) = (1111-1 -11011Y, 

with the zero (0) deriving from (4.9) and coinciding with the natural extension of 
(3.10). 

The fact that the sample space and hence the appropriate 7 vector 1s much 
enlarged by allowing tied rankings has no impact on the calculations outlined in 
(4.7) and (4.8) once the ¢ mapping is appropriately extended. In this way we also 


induce an extension of the distance 6 and the matrix A to the new sample space. 
It will, however, follow that the diagonal elements of the extended A matrix will 


INTERGROUP DIVERSITY AND CONCORDANCE FOR RANKING DATA 705 


be positive corresponding to rankings with ties; that is, a ranking with ties is not 
at “distance” zero from itself! Although slightly disturbing at first, it is possible 
and thought-provoking to justify such a phenomenon in terms of comparing two 
judges who give identical rankings with or without ties. We leave the reader to 
ponder this aspect. 

In the following section, we illustrate the application of the above ideas to 
various data sets. We focus attention on the analysis for T = Tx and T = Ty. 


5. Examples. 


5.1 Sutton’s data. Hollander and Sethuraman (1978) present data of C. 
Sutton on leisure time preferences for one group of 14 white females and one 
group of 13 black females. The analysis is summarized in Table 1. 

The apportionment of diversity shows that 31% for the Spearman metric (and 
27% for the Kendall metric) of the diversity is between groups. The percentages 
represent a sizeable proportion of the total diversity in the combined group of 27 
women. The coefficient of intergroup concordance p indicates very high relative 
concordance of 0.97 between the groups for the Spearman case but only 0.64 for 
the Kendall case. In other words, intergroup concordance is not so high if one 
uses the more sensitive Kendall distance. 

The significance tests (approximate) all lead to significant values, with the F 
statistics having “p values” less than 0.2%. We note that the analysis based on 
the combined sample estimate leads to more conservative values. This effect can 
be explained by the fact that the differences between the groups are blurred since 
the estimated covariance matrix is larger (in the ordering of positive definiteness). 
If one is testing H): 7, = m then the “pooled” estimate is preferable. 


TABLE 1 
Analysis of concordance’ Sutton’s data— black / white females 





Spearman Kendall 

Apportionment 

Within 0.88 (0.69 = &) 1.51 (0.73 = a) 

Between 0.41 D4 

Total 1.29 2.05 
Kraemer’s p 0.97 0.64 
Testing Hy), Hoal T) 

x?(df ) F'(df) x?(df ) F'(df ) 

Separate? 28 0(2) 12.8(2, 11)? 28.1(3) 7.8(3, 10)! 

Pooled? 28.5(2) 13.7(2, 24) 28.5(3) 8.7(3, 23) 

Combined? 13.8(2) x 13.9(3) x 


2 groups; n; = 14, ng = 13; N = 27; r=3, 

'The F statistic is Hotelling's T? statistic (see Section 4). 

“See the discussion of estimating © in Section 4 

‘Approximate (conservative) F approximation using min(n,,n,) — 1 for degrees of 
treedom of estimate $.. 


706 P. D. FEIGIN AND M. ALVO 


TABLE 2 
Analysts of concordance: Latrobe Valley data—male / female residents 











Spearman Kendall 
Apportionment 
Within 28.41 (0.98 = â) 20.81 (0.99 = &) 
Between 0.46 0.29 
Total 28.86 ' 21.10 
Kraemer’s p 0.997 0.96 
Testing Ha, HaT) 
x (df) F'( df ) x" (af ) F'(df ) 
Separate” 7.57) 0.9(7, 40)? 40.7(28) 0.6(28, 19)? 
Pooled? 7.5(7) 1.0(7, 87) 40.6(28) 1.0(28, 66) 
Combined? 7.0(7) x 28.6(28) x 
2 groups, n; = 47, ng = 48; N = 96; r = 8. 
'As for Table 1. 
“As for Table 1. 
‘As for Table 1. 


In this example there is some contradiction between the diversity analysis and 
the intergroup concordance result (at least for the Spearman case). The former 
indicates group differences whereas the latter indicates high intergroup agree- 
ment. The same is not true for the Kendall metric. An explanation is that 
Kraemer’s 6 depends on the maximum possible diversity (8) and if this is far 
from being attained within or in total then A will be large. That 6 seems to be so 
sensitive to the metric used is a disadvantage compared to @. Alternatively, one 
may claim that the intergroup concordance based on Kendall’s metric is more 
meaningful than that originally proposed by Kraemer. This claim is borne out by 
other examples which were investigated and in which the between group dif- 
ferences are significant. 


5.2 Latrobe Valley data. Residents in the Latrobe Valley of Victoria, 
Australia were asked to rank eight sectors in order of the degree that they would 
be affected by proposed industrial developments. Of the 95 respondents, 47 were 
male and 48 were female and the researcher wanted to know if there was a 
difference due to sex. 

The results of the concordance analysis (Table 2) show quite unequivocally 
that there is no difference in the rankings of the eight sectors between males and 
females. The ability to cross-check such conclusions by using two (or more) 
metrics is one of the main contributions of the proposed methodology. 


6. Conclusions. We have approached the comparison of groups of rankers 
from the point of view of analysing diversity based on various metrics for ranks. 
We have shown how this approach is related to others based on measuring 
intergroup concordance as well as interpreting the analysis as the comparison of 
population (or model) characteristics. 

A class of descriptive and test statistics have been proposed to allow the 
researcher to quantify the proportion of diversity ascribed to differences between 


INTERGROUP DIVERSITY AND CONCORDANCE FOR RANKING DATA 707 


groups as well as to test hypotheses of equality of populations or of their 
characteristics. 

Experience with examples indicates that the new types of statistics proposed 
as well as the extension of intergroup concordance originally proposed by Kraemer 
(1981) provide useful extra information for comparing groups of rankers. 


Acknowledgment. The data is available from the first author. We acknowl- 
edge the researchers of the Division of Building Research at CSIRO Victoria, 
Australia, who collected and made the data available to us. 


REFERENCES 


ALVO, M., CABILIO, P. and FEIGIN, P. D. (1982). Asymptotic theory for measures of concordance with 
special reference to average Kendall tau. Ann. Staust. 10 1269-1276. 

Draconis, P. (1985). Group Theory in Statistics. IMS Lecture Notes- Monograph Series. Forthcom- 
ing. 

Draconis, P. and GRAHAM, R. I. (1977). Spearman's footrule as a measure of disarray. J. Roy. 
Stahst. Sac. Ser. B 39 262-268. 

EFRON, B. (1982). The Jackknfe, the Bootstrap and Other Resampling Plans. CBMS-NSF Re- 
gional Conference Series in Applied Mathematics. SIAM, Philadelphia. 

FEIGIN, P. D. and COHEN, A. (1978). On a model for concordance between judges. J. Roy. Stattst. 
Soc. Ser. B 40 2038-213. 

HOLLANDER, M. and SETHURAMAN, J. (1978). Testing for agreement between two groups of judges. 
Biometrika 85 403-411. 

Katz, B. M. and McSwEENEY, M. (1981). Some tests fo? ranked data in repeated measures 
multi-group designs. Amer. Statist. Assoc Proc. Soc. Statist. Sec. 476-481 

KRAEMER, H. C. (1981). Intergroup concordance: definition and estimation. Biometrika 68 641-644. 

Li, L. and ScHUCANY, W. R. (1975). Some properties of a test for concordance of two groups of 
rankings. Biometrika 62 417-423. 

LINHART, H. (1960). Approximate test for m rankings. Biometrika 47 476-480. 

MALLows, C. L. (1957). Non-null ranking models I. Biometrika 44 114-130. 

MIELKE, P. W., BERRY, K. J., BROCKWELL, P. J. and WILLIAMS, J. S. (1981). A class of nonparamet- 
ric tests based on multiresponse permutation procedures. Biometrika 68 720-724. 

MOSTELLER, F. and TUKEY, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading, 
Mass. 

PATIL, G. P. and TAILLIE, C. (1982). Diversity as a concept and its measurement. J. Amer. Statist. 
Assoc. 77 548-561. 

PETTITT, A. N. (1982). Parametric tests for agreement amongst groups of judges, Biometrika 69 
365-375. 

Rao, C. R. (1982a). Diversity and dissimilarity coefficients: A unified approach. J. Fheoret. Pop. 
Bol. 21 24-43, 

Rao, C. R. (1982b). Gini-Simpson index of diversity: a charactenvation generalization and applica- 
tions. Ualitas Math. 21B 273-282. 

ScHUCANY, W. R and FRaw Ley, W. H (1973). A rank test for two group concordance. Psycho- 
metrika 38 249-258. 


FACULTY OF INDUSTRIAL ENGINEERING DEPARTMENT OF MATHEMATICS 
AND MANAGEMENT UNIVERSITY OF OTTAWA 

TECHNION OTTAWA, CANADA 

HAIFA 32000 


ISRAEL 


The Annals of Statistics 
1986, Vol 14, No 2, 708-724 


TESTING FOR NORMALITY IN ARBITRARY DIMENSION 


By SANDOR Cs0RG6 
Szeged University and University of California, San Diego 


The univariate weak convergence theorem of Murota and Takeuchi 
(1981) 1s extended for the Mahalanobis transform of the d-variate empirical 
characteristic function, d > 1. Then a maximal deviation statistic is proposed 
for testing the composite hypothess of d-variate normality. Fernique’s in- 
equality ıs used in conjunction with a combination of analytic, numerical 
analytic, and computer techniques to derive exact upper bounds for the 
asymptotic percentage points of the statistic. The resulting conservative large 
sample test 1s shown to be consistent against every alternative with compo- 
nents having a finite variance. (If d = 1 it is consistent against every alterna- 
tive.) Monte Carlo experiments and the performance of the test on some 
well-known data sets are also discussed. 


1. Introduction. Beside the permanent interest in testing univariate nor- 
mality, recent years have witnessed a large increase of interest in the correspond- 
ing equally important but more intricate problem of testing for multivriate 
normality. The work of Weiss (1958), Anderson (1966), Cox (1968), Healy (1968), 
Wagle (1968), Wilk and Gnanadesikan (1968), Day (1969), Mardia (1970, 1974, 
1975), Andrews, Gnanadesikan, and Warner (1971, 1972, 1973), Malkovich (1971), 
Aitkin (1972), Gnanadesikan and Kettenring (1972), Kessel and Fukunaga (1972), 
Dahiya and Gurland (1973), Malkovich and Afifi (1973), Mardia and Zemroch 
(1975), Giorgi and Fattorini (1976), and Hensler, Mehrota, and Michalek (1977) 
has been discussed in detail by Gnanadesikan (1977, pages 161-195), Cox and 
Small (1978), and Mardia (1980). The proposals by Sarkadi and Tusnady (1977), 
Small (1978, 1980), DeWet, Wenter, and van Wyk (1979), Pettitt (4979), Rincon- 
Gallardo, Quesenberry, and O'Reilly (1979), Hawkins (1981), Moore and 
Stubblebine (1981), Yang (1981), and Koziol (1982, 1983), not covered in the three 
surveys, are either continuations of earlier work or may be more or less fitted into 
one of the classification sections in Mardia (1980). The goodness-of-fit tests 
recently introduced by Bickel and Breiman (1983) [see also Schilling (1983a, b)] 
for a simple multidimensional hypothesis do not seem to lend themselves easily 
for adaptation to the composite case. 

The approach of the present paper is based on the asymptotic behaviour of the 
multivariate ‘“studentised” empirical characteristic function and is a multivariate 
extension of the recent approach of Murota and Takeuchi (1981) for testing 
univariate normality. Thus the basic weak convergence theorem in Section 2 
extends the corresponding result, Theorem 6, of Murota and Takenchi (1981) to 
arbitrary dimension. The distributions of the most relevant functionals, such as 


Received September 1984; revised April 1985. 

AMS 1980 subject classifications. Primary 62H15; secondary 62F03, 62F05. 

Key words and phrases. Empirical characteristic function, Mahalanobis transform, univariate and 
multivariate normality, weak convergence, maximal deviation, Fernique’s and Borell’s bounds on the 
absolute supremum of a Gaussian process. 


708 


TESTING FOR NORMALITY 709 


the absolute supremum and the squared integral, of the limiting parameter-free 
Gaussian process are not known even in the univariate case. Therefore, Murota 
and Takeuchi (1981) have considered the simplest possible projection functional 
as a test statistic for testing univariate normality. [See also Murota (1981); 
related univariate tests are in Hall and Welsh (1983) and We-sh (1985).] Although 
we briefly mention multivariate extensions of the projection statistics in Section 
3, Our primary aim is to give a tight bound on the tail of the distribution of the 
absolute supremum of the limiting multivariate Gaussian prozess. This is achieved 
by applying to all possible limits a powerful inequality o? Fernique (1975) in 
Section 4. The resulting formal conservative large sample Kolmogorov-type test 
is new even in the univariate case. The consistency of the test is also discussed in 
Section 4. Approximate computing formulae for our maximal deviation statistic 
are given in Section 5, a limited simulation study under tke null hypothesis is 
discussed in Section 6, while in Section 7 the performance of the test is illustrated 
on the well-known Norton’s bank data and Fisher’s Iris setosa data. A related 
Cramer—von Mises type statistic is mentioned briefly in Section 8. 


2. The basic weak convergence theorem. Let d > 1 bea fixed integer and 
let X(1),..., X(n) be independent d-dimensional random column vectors identi- 
cally distributed as X’ = (X,,..., X,), where the prime denotes transpose. Let 


O,(t) = U(E) + iV,(4) = n=? explidt, XO))), = (tyes tg) ERS 


be the sample characteristic function, where (¢,s) = U¢_,t,8, with s’= 
(8,,..-,;8 4) stands for the inner product in R“, and let S, = (s,,(n)) be the 
sample covariance matrix 


Sam(2) = 07! (XI) - An) Xal) - Fala), k, m= Led, 
pul 
where Xn) = (X,(n),..., X(n) = ae eae X (7) is the sample mean vector. 
Assuming that the underlying distribution function of X is continuous, we may 
almost surely consider the unique symmetric positive definit2 square root S; 7 
of the inverse S7! of S,. The studentised empirical characteristic function, or, 
rather, the Mahalanobis transform of C,(¢) is 


CS t) =n > exp( ilS; t, X(7))) 


gol 


(2.1) = n- Y exp( ict, X()S2)) 


=n F, exp(idt, X(J)))exp(it, XOS - D)), 


where I is the d x d identity matrix. Since its squared modtlus |C,(S7‘/7t)|? is 


710 S. CSORGO 


invariant under all nonsingular affine transformations of the sample 
X(1),..., X(n) for any vector t, there will be no loss of generality while proceed- 
ing toward the main goal of this section, Theorem 2.2, if in the preliminary 
Theorem 2.1 below, we assume that u = EX = 0 GR? and = = I, or, in fact, 
that the components of X are independent. 

Let T be an arbitrary positive number and let €= ¢([—T,7]?) and €? = 
@7([-T, T]%) be the separable Banach spaces of the d-variate continuous real 
and complex valued functions, respectively, defined on the cube [— T, T]? and 
endowed with the respective supremum norms. Let C(t) = U(t) + iV(t) = 
E exp(i(t, XY), t © R4, be the population characteristic function. Our basic 
stochastic process 


Z(t) = n'/7{1C,(S71/7e))? — ee) 


is a random element of @ for each n = 1,2,.... Let us introduce the vector of 
partial derivatives of C 





, dC(t) dC(t) 
(velt) = ae ae | 
and the corresponding d X d Laplacian matrix y ?C(t) with elements 
a*C(t) 
TETA k,m=1,...,d. 


Assuming that the vector (PY = (n%,..., uP) = (EX],..., EX4) is finite, con- 
sider the d-variate complex Gaussian process Y(t) satisfying Y(t)= Re Y(t) — 
i Im Y(t) = Y(—t) and EY(t) = 0 for each t and for each s, t € R4, 


p(s, t) = EY(s)Y¥(t) 
= C(s + t) - C(s)C(t) 


n = (s(v*C(¢)) vC(8) + #(9?C(8))VC(t) 





(2.2) +C(s){t, VC(t)) + C(t)(s, VC(s))} 
1f £ 0C(s) dC 
alEO- Denta ma a 








a TS = Sii -a 
km ba ii ÔE m Ôt m d JE m dt, 


That such a process exists, that is, that p indeed is a covariance function, may be 
seen by considering the random function 


R(t) = R(t; X) 
(2.3) 


d | dC(s) aC(t) éC(s) al 





1 d 
= et) — C(t) - 3 X 


mm l 


in @”, where the components of X’ = (X,,..., X2) are independent with EX, = 


dC(t) 


d 
(x3 a 1)t,, T Ža 2 X pt 3t 


k=l, kam 


TESTING FOR NORMALITY 711 


- = EX,=Oand EX? = --- = EX}? = 1. Then a straightforward but some- 
what lengthy computation shows that p(s, t) = ER(s)R(t). Moreover, since 1 
is finite, C(t) has uniformly continuous partial derivatives of the fourth order 
over the whole space R“. Therefore a one-term d-variate Taylor expansion easily 
gives 
(2.4) E|Y(s) — Y(t)? = E|R(s) — R(t)? < Ks — t”, 


with some constant K > 0, and this is more than enough to imply the sample 
continuity of the complex Gaussian process Y (Fernique, 1975, page 48). Hence Y 
may be considered as a random element of ¢?, and consequently the d-variate 
real Gaussian process 


(2.5) Z(t) = 2{U(t)Re Y(t) + V(t)Im Y(t)} = 2Re{C(-t)Y(t)}, 
with mean zero and covariance function 
o(s,t) = EZ(s)Z(t) 

= 2Re{C(-s)C(—t)p(s, t) + C(s)C(—t)p(—s, t)} 
is a random element of €. Note that Z(t) = Z(-— t). 


(2.6) 


THEOREM 2.1. If the components of X’ = (X,,..., X4) are independent with 
EX, = -:: = EX,=0, EX? = --- = EX} = 1, and the vector p finite, then 
the sequence {Z,(-)} converges weakly, as n> œ, in €6([—T,T]*) to the 
Gaussian process Z(:). 


The proof of Theorem 2.1 is in the Appendix. 
Let N,(u, 2) denote the d-dimensional normal distribution with mean vector 
u and covariance matrix 2, where d > 1. Our aim is to test the composite 
hypothesis 
H,: The law of X is Ni(p, 2) with some u and some nonsingular 2. 


Note that when C(t) is real then from (2.2) and (2.6) we obtain 


o(s, t) = 20(8)0(0)| C1 + t) — C(s — t) — 2C(s)C(£) 
+ {s'(V?C(t))VC(s) + t’(v2C(s))VC(t) 
+C(s8)<t, VC(t)) + C(t)(s, VC(s))} 


(2.7) 1({ 4 aC(s) aC(t) 
= (4) _ SS 
s Es )8ntm gg at 


d | aC(s) C(t) aC(s) all 


k,m=l,kem 8 m Itm I8 m dt 


m 








m 


THEOREM 2.2. If H, holds then the sequence of stochastic processes 
Z,(t) = {O(S t)? — 7%} 


712 S. CSORGO 


converges weakly in €([—T, T']*) to the Gaussian process Z(t) satisfying Z(t) = 
Z(—t), EZ(t) = 0 and 


o(s, t) = EZ(s)Z(t) = 4e7 6 ~ { cosh((s, t») — 1 — 4s, ae 


ProoF. Since the limiting process is the same for any u and 2 under H,,, the 
theorem follows by substituting C(t) = exp(— 4(t, t)) and py = 3 into (2.7). 
Then a lengthy computation yields the indicated formula. 


Note that cosh(x) — 1 — (x?/2) = O(x*) as x > 0, and the process Z(-) in 
Theorem 2.2 has the interesting property that Z(s) and Z(t) are uncorrelated 
and hence independent for any vectors s and ¢ that are orthogonal to one 
another. For nonzero and nonorthogonal s and t, o(s, t) > 0. 


3. Simple projection statistics. The simplest possible such statistic is 


obtained if we consider the nonzero vectors ¢,,..., ¢,, somewhere in the vicinity 
of the origin, such that the L x L matrix R = (o(¢,, t,,)) be nonsingular and 
form the quadratic form Q, = Q,(t,,...,¢,) = 2,R7'z,, where z’ = 


(Z,(t,),-.-,Z,(£,)) with Z, and o of Theorem 2.2. Then under H, the asymp- 
totic distribution of Q, is the chi-square distribution with L degrees of freedom. 
There is no theoretically justified ground, however, upon which the number L 
and the location of the vectors ¢,,..., ¢; could reasonably be chosen. 

Another, perhaps more appealing d-dimensional extension of the Murota and 
Takeuchi (1981) statistic is based on the observation at the end of the preceding 
section. Let ¢,,..., ty be nonzero, pairwise orthogonal vectors from R“ and set 
N” = max(|Z,(t,)|,..-,|Z,(t,)|). Then under H,, 


d x 

lim Pr{ Ni?) < x} = Moz] - 3}, O<x < ow, 
no ko] o( tl) 

where ©® is the standard normal distribution function and o?(|t|) = o(t, t) = 

4 exp{ — 2|żt|?){cosh(|t|?) — 1 — 4)¢|*} depends only on the length of t. The stan- 

dard-deviation function o(|t]) has a unique global maximum on [0, 09), 


(3.1) a(|tp|) = 0.23743 at |t,| = 1.4684924, 


determined on the computer. So N{® = N‘?(¢,,..., t4) is asymptotically “most 
variable” under H, on the surface of the d-dimensional ball r? = |t,|?. For the 
sake of later comparison let us choose all the points ¢,,..., ¢, on this surface and 
record the asymptotic a level significance points obtained from the equation 
(20(x /0.23743) — 1) =1-—a,0<a<1. 


4, The maximal deviation statistic. The natural extension of Ní” is 


MYT) = sup |Z,(t))=n'/? sup ||C,(S,'t)|? — exp(—|dl?)I, 
te(-T,T]* EL- ETI 


where T is some positive number. Note at the same time that the restriction of 
the supremum to a finite cube [— 7, 7]? is not a theoretical restriction for the 


TESTING FOR NORMALITY 713 


TABLE 1 
Asymptotic | — a percentage points of NEG, esata) 
with |t| = -- = (tal = lool 


a 


0.1 0.3906 0.4630 0.5176 0.5276 0.5485 0.5893 
0.05 0.4854 0.5318 0.5675 0.5912 0.6090 0.648} 
0.01 0.6102 0.6672 0.6957 0.7194 0.7360 0.7669 


problem at hand. Indeed, as a consequence of the corresponding well-known 
univariate result and the fact that the univariate normality of all the linear 
combinations of the components of a vector implies the multivariate normality of 
the vector, we have the following: If a d-variate characteristic function coincides 
with a given d-variate normal characteristic function in any small neighbourhood 
of the origin, then they coincide everywhere on R7. Therefore, the only considera- 
tions that should be made when choosing T > 0 are those that relate to finite 
sample behaviour and computational ease. 

Under Hy, lim „~o Pr{M{(T) = y} = Fy, r(Y) for any y > 0, where with Z 
as in Theorem 2.2, 


Fy 7(y¥) = Pr sup |Z(¢) > y}. 
te[-T,T]* 


Of course, this function is not known. We wish to give an upper bound for it. By 
the inequality of Fernique [(1975), page 51], for any integer p > 2, we have 


TANE a ; 
(4.1) F, 7(xKq7(p)) = 5 5 p**(1 — (xi) 
for any 
(4.2) x > (1+ 4d log p), 
where 
(4.3) Kar(p)= sup o(s, t) + (2 + v2) f” $u, (Tp*) du, 
s, te[-T,T]’ 
with 
(4.4) a r(h)= sup (a(s, 8) + o(t, t) — 20(s, t))Ž, 


s, tE[-T,T]*, ls- tish 
where ||s — ¢|| = max{|s, — ¢,|: 1 < k < d} is the maximum norm. 
By the Cauchy-Schwarz inequality we have 
(4.5) sup o//%(s,t)= sup o(jti), 
s,te[-T, T]! 0 sit] < Td” 


where o7(|¢|) = o(t, t) is the variance function. As noted, o(|t|) has a global 
maximum at |é,| of (3.1). For the sake of definiteness and in order to include the 


714 S. CSORGO 


surface of the ball where Z(t) is most variable, we choose 
(4.6) T= 147/d'? 


and suppress henceforth this T in the notation. The first term of K,( p) is then 
given by (3.1), and the basic problem is to determine the function $,(/) in (4.4). 
We shall see that the main advantage of the choice in (4.6) is that K „( p) will not 
depend on the dimension if it is higher than one. [See (4.10) and (4.11) below]. Set 


o(8,s) + a(t, t) — 20(s, t) 


Aq= sup oa 
s, t@[—1.47/d'*, 1.47/d'7]4 |s | 
and 
(4.7) B= sup 4e-2“g’(u?), 


Osus 1.47 
where g(x) is the derivative of the function 
Pe 
g(x) = cosh(x) -1- 7? x20, 
and note that 


2 


e~*"'g(u?) +e? g(v?) — 2e" gluv) 


(48) A,= sup 4 5 
O<u,eslA7 (u — v) 


The key step is the following lemma, also proved in the Appendix. 
LEMMA 4.1. For anyd = 2, A, = max(A,, B). 


Of course, B can be computed on the computer easily, and we obtain 
(4.9) B = 0.1265243. 


A combination of careful numerical analysis and some computer work also gives 
that A, < 0.1085898. Consequently, for $,(/) in (4.4), we have 


(4.10) oi(h) < Adh, O<h<o, 
where by Lemma 4.1 and (4.9), 


ivef < 0.3295297, ifd= 1, 
Af = 0.3557025, if d > 2. 


Now putting T = 1.47d7'” in K (p) of (4.3), by (4.5), (3.1), (4.10), (4.11), and 
a simple substitution in the integral, we obtain 


(4.12) K (p) < Lal p) = 0.23743 + V(log p) "(1 ~- ©((2log py), 


(4.11) 


where 


2.9314164, if d= i; 
1.47(2 + V2 (Aar)? < Vy= Horr fd > 2 


TESTING FOR NORMALITY 715 


We may now return to (4.1) and (4.2). Fix an a, 0 < a < 1, and introduce the set 


E (a) = |» p integer, p > 2, @7'|1 — > (1+ 4d on 9)", 


a 
5(7/2) 7 p> 
where `~! is the inverse function to ®. Then, with L (p) as in (4.12), for 
M‘) = M‘1.47/d'/) we have the following result. 


THEOREM 4.2. If H, holds then 
lim Pr{ Mi > z4(a)} < a, 


where 


a 


2(e)= inf Ole ee Llp). 
g(a) a et B( 1/2) /? p24 d 


The differences z,(a) — yala) = 0, where y,(a) is the real asymptotic 1 — a 
percentage point, ie. F,(y,(a)) = a, are not known. However, the following 
table of the z,(a) values suggests in comparison with Table 1 that the Fernique 
inequality is quite powerful on our process Z(t) and, therefore, the unknown 
differences z,,{(a) — y,;(a) = 0 are hopefully not too large. 

The computation of this table required a table of ®(x) with x in the interval 
[2.65, 8.35] and with 17 precise decimals. The p values at wnich the correspond- 
ing infima were taken ranged from 7 to 12. 

As to the consistency of the test, we can prove the following 


THEOREM 4.3. If the components of X are linearly independent with finite 
variances but H, does not hold, then M\®) > œ almost surely as n — œ. Hence 
the test is consistent against all such alternatives. 


ProoFr. Since S~!/* — 2 almost surely, where = is a symmetric positive 
definite d X d matrix, and C,(-) converges to the characteristic function C(-) of 
X almost surely uniformly on compact sets, C (S7 t) > C(t) = C(St) almost 
surely uniformly on K,=[-—T7,T]%, where T is as in (4.6). This implies 
|C,(S,'/*t)|? > |Co(t)|? uniformly on K4. 

Suppose that C,(t)C,(—t) = |C,(t)|* = exp(—|¢|?) on K,. Since exp(—|t|*) is 
a d-variate normal characteristic function, and since |C,(¢)|? is a characteristic 


TABLE 2 
The upper bounds 2,(a) of the asymptotic 1 — a percentage points of M® 


0.1 0.9648 1.2613 1.4963 1.6985 1.8804 2.0466 
0.05 1.0101 1.2998 1.5294 1.7296 1.9024 2.0730 
0.01 1.1087 1.3822 1.6034 1.7973 1.9719 2 1257 


716 S. cSORGO 


function, the observation made at the beginning of this section implies that the 
equality C,(t)C,(—t) = exp(—|¢|*) must hold on the whole space R“. By the 
obvious multivariate version of Cramér’s theorem [Lukacs (1970), Theorem 8.2.1; 
obtained again from the univariate result by taking linear combinations] the 
latter identity implies that the component C,(¢) is a normal characteristic 
function. This, in turn, implies that C(t) itself is a normal characteristic function, 
which contradicts to the assumption that H, is not satisfied. Therefore, 


lim sup ||C,(S7 t)? — e~!’ = sup |C (t)? — e° > 0 
tek 


sih 
ROO tEKy d 


almost surely, and since the M‘) are n! times these suprema, n = 1,2,..., the 
assertion follows. 

If one of the components of X has an infinite variance, then one feels that the 
test is “all the more consistent” that is, the natural conjecture is that it is 
consistent against every fixed alternative. However, the behaviour of S>'” is 
unclear with infinite variances and I do not have a formal proof for this case, 
except when S~'/” converges to the zero matrix. This is the situation if d = 1, 
and so the test is indeed consistent against every alternative in the univariate 
case. Note that if d = 1 and EX? = œ, then almost surely as n > o, 


sup |C,(¢/S1/7)-—e-*| > sup |1—-e7*| =1-— e794 = 0.8848. 
-147<t<1 47 O<t<1.47 


5. Computing formulae. Writing Y/ = (Y/,..., YJ) = a (DX DS, 
j=1,...,n, with a,(L) = 1.47/(d'/*10“) and c¢,(x) = cos kx, s(x) = sin kx, 
where k and m will denote integers and L > 0, and using that Z,(t) = Z,({—6), 
we have 














12 = fia : 
M =n max |- $ ¢,(¥/)| +{— E s,(¥’)] — eae], 
1sks10" N pm] n j=l 
and using the sine and cosine addition formulae, 
1 2 i 
Maa max — E {em( ¥/ )ee(¥4) ~ 8a(¥/)s,(¥4)} 
~10% <ms<10",1sksl0"}\ P jat 











L y {Sml Y/)c,( Y2) + n(Y s44) 


gel 





—exp(—(ka,(L))” = (ma,(L))’) 


$ 





where the larger L is, the more precise are the approximate equalities =. 
Analogously, increasingly more complex formulae can be written down for M1”, 
d > 3. The point is that many computers compute sine and cosine slowly, and if 


TESTING FOR NORMALITY 717 


we use the recursion 
cal) = 2e,(-)e,(-) — 1, sa) = 2e,(-)s,(-), 
e,(-) = 2e,(-)eg_i(-) — Cral) 844+) = 2e,(-) 84-10) — 84-26), 
k=8,...,10*, 


then in d dimensions we need to compute only nd sines and nd cosines. In 
practice, L = 2 is usually sufficient. 


6. Simulation. In the univariate case, we conducted a very limited Monte 
Carlo experiment to determine empirically approximate values of the unknown 
limiting percentage points y,(a), a = 0.1, 0.05, 0.01. Normal (0,1) random num- 
bers with sample sizes 50 and 100 were generated 500 times in both cases, and 
ME and MQ} were computed by the above formula with L = 2. The obtained 
percentage points of the 500 samples for M and M{)} are the following: 





. 0.5069 0.5160 
0.05 0.6044 0.6173 
0.01 0.9678 0.8455 


These should be compared with the first columns of Tables 1 and 2. 

Of course large order statistics of M® in 500 samples are more unstable than 
smaller ones. 

In general (d > 1), the very sharp inequality of Borell (1975) says instead of 
(4.1) that 


F, (xo +my)<1-O(x), x>0, 


where o is the supremum of the pointwise standard deviation of Z(t) on 
[-T,T]%, and m, is the median of the distribution of M‘ = sup{|Z(¢)|: 
t © [—T,T]*}. The problem is, of course, that we do not know m,. With the 
choice of T as in (4.6) (or larger), o = 0.23743 according to (3.1). (If we use 
Fernique’s inequality first to give an upper bound for m, and then Borell’s 
inequality with this upper bound, then we obtain larger values than those given 
in Table 2.) 

The median of the 500 samples for MẸ was 0.2258, and that of for MQ) was 
0.2273. Arguing that simulation of middle percentage points is more stable, let us 
accept that 0.23 is a close upper bound for m,. Then Borell’s inequality gives the 
following tentative close upper bounds for y,(a): 





A thorough simulation study of the properties of the M‘@ test would be 
desirable. 


718 S. CSORGO 


7. Examples. Our first example is testing the bivariate normality of Norton's 
rate of discount and ratio of reserves to deposits data of size 780 as given on page 
205 of Yule and Kendall (1950). The histogram of Yule and Kendall shows clearly 
that the distribution is not normal. Indeed, Mardia (1970) rejects normality by 
his two tests and Rincon-Gallardo et al. (1979) also reject normality by both the 
tests they use after their transformation. In our case M2) = 19.9785 (with 
L = 2), which in comparison with the second column of Table 2 shows an 
extremely significant departure from bivariate normality. 

The second example is the well-known Iris setosa data originally analysed by 
Fisher (1936). The data consists of 50 observations on each of four variables 
(sepal length, sepal width, petal length, petal width), and it is commonly believed 
that, in some form or other, it is from a quadrivariate normal distribution. 
Rincon-Gallardo et al. (1979) accept this hypothesis for the original data by both 
of their tests. In our case M = 5.6967 (we computed the latter value with only 
L = 1 in the approximate four-variate formula) which in comparison with the 
fourth column of Table 2 shows a highly significant departure from quadrivariate 
normality. Contradicting Rincon-Gallardo et al. (1979), Koziol (1982, 1983) accepts 
the four-variate normality of the logarithms of the original Iris setosa data as 
given in Gnanadesikan [(1977), page 219]. For these logarithms, we obtained 
MS} > 5.8845 (again with L = 1), so that the significance of departure from 
normality is even higher than for the original data. Hence we reject the hypothe- 
sis of quadrivariate normality of both the original and the logarithmic data. 

The limited experience in the present and preceding sections suggests that the 
test based on M{® may be highly sensitive. The only arbitrary element in our 
test M! XT) is the choice T = 1.47/ yd. It is conceivable that the larger is T, 
the more powerful will be the test. Of course, the A, and hence the V, constants 
belonging to another choice T = T),/ yd > 1.47/ Vd can be easily recomputed, 
and then the subsequent bounds z,(a) can also be obtained. However, the larger 
is T,, the slower is the convergence of M‘*(T,/ Yd ) and also more computer time 
is needed for the computation of the statistic. We believe that our choice 
T, = 1.47, with the given motivation, is a reasonable compromise. 


8. The Cramér—von Mises type statistic. Another plausible statistic based 
on the process Z,(-) would be to consider 


[ Bet) de, 
(aa Tl? 


with some T > 0, which by Theorem 2.2 converges in distribution to £7_,A,W,, 
where W,, W,,... are independent standard normal random variables and A, = 
A ,(d, T) are the solutions of the eigenvalue—eigenfunction equation 


f ols, t)$(s) ds = d9(t), 
[-T,T] 


with the covariance function o(-,-) given in Theorem 2.2. It would be desirable 
to compute numerically a sufficient set of the largest eigenvalules to approximate 


TESTING FOR NORMALITY 719 


adequately the limiting distribution. A referee suggested a discrete approxima- 
tion to the above equation using the covariance matrix o(s,,¢,) on some 
appropriate grid {(s,, t,)}. [See, for example, Schilling (1983b).] 


APPENDIX 


PROOF OF THEOREM 2.1. We have 
Z(t) =n'/*{U,(S7'/t) — U) H Ual Sr 7t) + U(t)} 
+n VS At) — VEH VS t) + Ve). 


Hence, on account of the fact that S~'/* > I almost surely and the triviality 
(Csorgé, 1981) that C, converges almost surely uniformly to C on any bounded 
set in R“, the theorem will follow once we have shown that the complex valued 
processes 


Y,(t) = {C (Srt) - C(t)} 


converge weakly in €? to the complex Gaussian process Y(t). We proceed to 
prove this. 

Applying the one-term d-variate Taylor formula to the second exponential 
function in the third line at (2.1), we obtain 


Y(t) =n fexp(ict, X(j))) — C(t) 


+<((S7 1? — I)t, 0C,(t))} + A,(2), 


where 


1 n 
sup |A,(t)}<s sup —=—n'/? E USG- I), XUY 
tel -T, T]! te[-T, T]! j=l 
are 
< (2n) X IX()? sup mA (Sg -— Tel? 
Jo] te(-T,T]" 


= O(n~'/*log log n) 


almost surely by the law of the large numbers and the law of the iterated 
logarithm, the latter being applied to the elements of (S, — J) after the re- 
arrangement 

(A.1) $717 -I= -SG (I+ SyS,- 1). 

A result in Csörgő (1981), conveniently formulated for the present purpose in 
Theorem 2.1 of Ledoux (1982), implies that 


sup |VC,(t) — VC(t)| = O(n~'(loglog n)'””) 
‘e[-T,T}*" 


720 S. CSORGO 


almost surely. Whence, using (A.1), we see that 


sup |(n'/?(S7/? — It, VC,(t)) + 4(n'/(S, — 1)t, vC(t))| 
te[-T,T]? 


= O(n 7 log log n) 


almost surely. Let 1 < m <d. A simple computation justifies that the mth 
component of the vector n'/*(S, — It is 


nie yy (x2) — tnt XA) A A(t} ~ 0X, (n)(X(n), t). 


k=i1,k#m 


The supremum of the second term here, over [—T7,7]%, is again 
O(n~'/* log log n) almost surely by the classical log log law. Hence 


(A.2) Y,(t) =n" R(t) + B,(2), 


where R(t) = R(t; X(j)) with R(t; X) as in (2.3), and 


sup |B(t)| = O(n7'/7loglogn) 
te[~T,T]* 


almost surely. Since the random functions R,(é) in the representation (A.2) are 
independent and identically distributed with mean zero and covariance p(s, £), 
the multidimensional central limit theorem implies that the finite-dimensional 
distributions of Y (-) converge to those of Y(-). The tightness of {Y (C )} follows 
from (2.4), and hence the theorem. 


PROOF OF LEMMA 4.1. First we fix the lengths u = |¢|, o = |s| of the vectors 
and let the inner product x = (s, ŁY vary in its range 0 < x < uv < 27 (u? + 0’). 
We have 

Ay = sup Sup l (x), 


Osu, 031.47 Osx uv 
where ; 
e=" g(u?) + e7? g(v?) — 2e-* a(x) 


(u? + v? — 2x)’ 


fu (1) = 4 


with derivative 


be (a) = ae ela +o alo) - 2e- g(x)) 


(u? + v? — 2x)’ 


2e -Àg (x)(u? + v? — 2x) 
(u? + vo? — 2x)’ 


— 


Clearly, 
A,= sup fu uv) =A,, 
O<u,0< 1.47 


TESTING FOR NORMALITY 721 


and 
A,2 sup sup f,_,(x) 
Osugi As? Osxsu? 
glu’) — g(x) 


u? — x 


Il 


= 2 
sup 4e7?2” sup 
OsuslA7 Osasu? 


= B 


since g is a convex function. Hence A, > max(A,, B). 
To prove the reverse inequality, consider the alternative: 


e`?“ gfu?) + e7*"g( v?) < (or >) 2e7" -gh L(y? + v7)). 
If“ <”, then 


gliu? + v?)) - g(x) 


x) g 4e" 
Linck ) L(u? + o?) =x 


< 4e7" ~'g’(4(u? + v*)) 


by the convexity of g. In the opposite case the numerator of [,/ (x) is larger than 


16e" -À {g(i(u? + v?)) — g(x) -— e(x){i(u? + p*) — x}} 


and this lower bound is nonnegative again by the convexity of g. So if “ > ”, then 
fu, (x) S fu, (uv). Hence Ay < max(A,, B) and the lemma is proved. 


Acknowledgments. I am grateful to David Mason and Vilmos Totik for 
their comments and to Pal Tőke for programming the computations in Sections 6 
and 7. The comments of two referees are also appreciated. 


REFERENCES 


AITKIN, M A. (1972). A class of tests for multivariate normality based on linear functions of order 
statistics. Unpublished manuscript. 

ANDERSON, T. W. (1966). Some nonparametric multivariate procedures based on statistically equiv- 
alent blocks. In Multtwartate Analysis (P. R. Krishnaiah, ed.) 5-27. Academic, New York. 

ANDREWS, I). F., GNANADESIKAN, R. and WARNER, J. 1. (1971) Transformations of multivariate 
data. Biometrics 27 825-840. 

ANDREWS, I). F., GNANADESIKAN, R. and WARNER, J. L. (1972). Methods for assessing multivariate 
normality. Bell Laboratories Memorandum. f 

ANDREWS, D. F., GNANADESIKAN, R. and WARNER, J. I. (1973). Methods for assessing multivanate 
normality. In Multwarıate Analysis 3 (P. R. Krishnaiah, ed) 95-116. Academic, New 
York. 

BICKEL, P. J. and BREIMAN, L. (1983). Sums of functions of nearest neghbor distances, moment 
bounds, limit theorems and a goodness of fit test. Ann. Probab. 11 185-214. 

BoRELL, C. (1975). The Brunn- Minkowski inequality in Gauss space. Invent. Math. 30 207-216 

Cox, D. R. (1968). Notes on some aspects of regression analysis. J. Roy Statist. Soc. Ser A 131 
265-279. 

Cox, D. R. and SMALL, N. J H. (1978). Testing multivariate normality. Biometrika 65 263-272. 

CsÖRGŐ, S. (1981). Multivariate empincal characteristic functions. Z. Wahrsch cerw. Gebiete 55 
203-229. 


722 S. CSORGO 


Dauiya, R. C. and GURLAND, J. (1973). A test of fit for bivariate distributions J. Roy. Statist. Soc. 
Ser B 35 452-465. 

Day, N. R. (1969). Divisive cluster analysis and a test for multivariate normality. Bull Inst. 
Internat. Statist. 43 110-112. 

DEWET, T., VENTER, J. H. and VAN Wyk, J. W. J. (1979). The null distributions of some test criteria 
of multivariate normality. South African Statist. J. 13 153-176. 

FERNIQUE, X, (1975). Régularité des trajectoires des fonctions aléatoires gaussiennes. In Ecole d'Été 
de Probabuitees de Saunt-Flour, 1V-1974, Lecture Notes in Math. 480 1-96. Springer, 
Berlin. 

FISHER, R A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 
7 179-188. (Also paper 32 in Contributions to Mathematcal Statstics, R. A Fisher. 
Wiley, New York.) 

GIORGI, G. M. and FATTORINI, L. (1976), An empirical study of some tests for multivariate normality. 
Quadern dell Instituto di Statistica No. 20 (8 pages). Siena. 

GNANADESIKAN, R. (1977). Methods for Statisthcal Data Analysis of Multwartate Observations. 
Wiley, New York. 

GNANADESIKAN, R. and KETTENRING, dJ. R. (1972). Robust estimates, residuals, and outher detection 
with multiresponse data. Biometrics 28 81-124. 

HALL, P. and WELSH, A. H. (1983). A test for normality based on the empirical characteristic 
function. Biometrika 70 485-489. 

HAWKINS, I). M. (1981). A new test for multivariate normality and homoscedasticity Technometrics 
23 105-110. 

HEALY, M. J. R. (1968). Multivariate normal plotting. Appl. Statist. 17 157-161. 

HENSLER, G, I., MEHROTA, K. G. and MICHALEK, J. E. (1977). A goodness of fit test for multivariate 
normality. Comm. Statist, A—Theory Methods 6 33-41. 

KESSEL, D. I. and FUKUNAGA, K. (1972) A test for multivariate normality with unspecified 
parameters. Unpublished report, Purdue Univ. 

Kozio., J. A. (1982). A class of invariant procedures for assessing multivariate normality. Brometrika 
69 423-427. 

KOZIOL, J. A. (1983). On assessing multivariate normality J. Roy. Stahst. Soc. Ser B. 45 358-361. 

LEDOUX, M. (1982). Loi du logarithme itéré dans @#(S) et fonction caracteristique empirique. Z. 
Wahrsech verw. Gebiete 60 425-435. 

Lukacs, E. (1970). Characteristic Functions. Griffin, London. 

MALKOVICH, J. F. (1971) Tests for multivariate normalty. Unpublished Ph.D. thesis, Univ. of 
California, Los Angeles. 

MALKOVICH, J. F. and AFIFI, A. A. (1973). On tests for multivariate normality. J. Amer. Statist. 
Assoc. 68 176-179. 

Marpia, K. V. (1970). Measures of multivariate skewness and kurtosis with applications. Biometrika 
57 519-530. 

MaRDIa, K. V. (1974). Applications of some measures of multivariate skewness and kurtosis for 
testing normality and robustness studies, Sankhya Ser. A 36 115-128. 

Marpia, K. V. (1975). Assessment of multinormality and the robustness of Hotelling’s T? test. J. 
Roy. Statist. Soc. Ser. C 24 163-171. 

Marpia, K. V. (1980). Tests of univariate and multivariate normality. In Handbook of Statistics 1. 
Analysis of Varvance (P. R. Krishnaiah, ed.) 279-320. North-Holland, Amsterdam. 

MARDIA, K. V. and ZeMrRocH, P. J. (1975). Algorithm AS84. Measures of multivariate skewness and 
kurtosis. J. Roy. Statist. Soc. Ser. C 24 262-265. 

Moore, D. S. and STUBBLEBINE, J. B. (1981). Chi-square tests for multivariate normality with 
applications to common stock prices. Comm. Statist. A—Theory Methods 10 713-738. 

Murota, K. (1981). Test for normality based on the empirical characteristic function. Rep. Statist. 
Appl. Res. Un. Japan Sa. Engrs. 28 1-17. 

Murora, K. and TAKEUCHI, K. (1981). The studentized empirical characteristic function and its 
application to test for the shape of distribution. Biometrika 68 55-65. 

PETTITT, A. N. (1979). Testing for bivariate normality using the empirical distribution function. 
Comm. Statist. A—Theory Methods 8 699-712. 


TESTING FOR NORMALITY 723 


RINCON-GALLARDO, S., QUESENBERRY, C. P. and O'REILLY, F. J. (1979). Conditional probability 
integral transformations and goodness-of-fit tests for multivanate normal distributions. 
Ann, Statist, 7 1052-1057. 

SARKADI, K. and TUSNADY, G. (1977). Testing for normality and for the exponential distnbution. In 
Proc. Fifth Conference on Probabutty Theory 1974, Brasov (B. Bereanu, M. Josifescu, 
and G. Popescu, eds.) 99-118. Editura Academici, Bucharest. 

SCHILLING, M. F. (1983a). Goodness of fit testing in R” based on the weighted empirical distmbution 
of certain nearest neighbor statistics. Ann. Statist. 11 1-12. 

SCHILLING, M. F. (1983b). An infinite-dimensional approximation for nearest neighbor goodness of fit 
tests. Ann. Statist. 11 13-24. 

SMALL, N. J. H. (1978). Plotting squared radi. Biometrika 65 657-658. 

SMALL, N. J. H. (1980). Marginal skewness and kurtosis in testing multivariate normality. Appl. 
Statist, 29 85-87. 

WAGLE, B. (1968). Multivariate beta distnbution and a test for multivanate normality J. Roy. 
Statist. Soc. Ser. B 30 511-536. 

WELSH, A. H. (1985). A note on scale estimates based on the empirical characteristic function and 
their application to test for normality. Statist. Probab Letters 3. 

Weiss, L. (1958). A test of fit for multivariate distributions. Ann. Math. Statist. 29 595-599. 

WILK, M. B. and GNANADESIKAN, R. (1968). Probability plotting methods for the analysis of data 
Biometrika 55 1-17. 

YANG, S.-S. (1981). Linear combinations of concomitants of order statistics with applications to 
testing and estimation. Ann. Inst. Statist. Math. 33 463-470. 

YULE, G. U. and KENDALL, M. G. (1950). An Introduction to the Theory of Statistics 14th ed 
Hafner, New York. 


BOLYAI INSTITUTE 
SZEGED UNIVERSITY 
ARADI VERTANUK TERE 1 
H-6720 SZEGED 
HUNGARY 


The Annals of Statistics 
1986, Vol 14, No. 2, 724-732 


MINIMAX VARIANCE M-ESTIMATORS OF LOCATION 
IN KOLMOGOROV NEIGHBOURHOODS! 


By Douc WIENS 
Dalhousie University 

We exhibit those distributions with minimum Fisher information for 
location in various Kolmogorov neighbourhoods {F'|sup,|F(x) — G(x)| < e} 
of a fixed, symmetne distribution G. The associated M-estimators are then 
most robust (in Huber’s minimax sense) for location estimation within these 
neighbourhoods. The previously obtained solution of Huber (1964) for G = ® 
and “small” e is shown to apply to all distributions with strongly unimodal 
densities whose score functions satisfy a further condition. The “large” e 
solution for G = ® of Sacks and Ylvisaker (1972) ıs shown to apply under 
much weaker conditions. New forms of the solution are given for such 


distributions as “Student’s” £, with nonmonotonic score functions. The gen- 
eral form of the solution 1s discussed. 


1. Introduction and summary. Consider Huber’s (1964) theory of robust 
M-estimation of a location parameter 6. Let @ be defined as a zero of Lib(x, — - ), 
for a suitably chosen y, where X, ~ F(x — 6) and F is an unknown member of a 
convex, vaguely compact class ¥ of distributions. Typically, yn (Ê — 0) is 
asymptotically normally distributed. Let V(), F) denote the asymptotic variance 
functional. The choice yọ is then most robust, in the minimax sense, if it 
minimizes supgV(wW, F). 

In Huber (1964) and in particular in Chapter 4 of Huber (1981), general 
procedures are derived for finding most robust M-estimators. We briefly sum- 
marize what are, for us, the salient features. One first demands optimality only 
over that subclass F’ of # whose members have finite Fisher information for 
location (F). Any F € F’ necessarily has an absolutely continuous, bounded 
density f, tending to 0 as x > + œ, and then F) = {( f'/f Yf dx. There exists 
Fy € F’ minimizing (F). If I( Fj) > 0, and fọ has convex support, then Fo is 
unique. Furthermore, Yo = —fo/fo is most robust over F’. If F’ is vaguely 
dense in F, and if y, is sufficiently regular—see Theorem 5 of Huber 
(1964)--then yọ is optimal over the larger class. Necessary and sufficient for Fo 
to minimize I(F’) is the condition 


(1.1) [2 fo- fot (f-f)¥ide>0, al FEF". 


In this paper we apply the above theory to cases in which F, written K,, is a 
Kolmogorov neighbourhood of a fixed distribution G: K, = (F'|sup,|F(x) — 
G(x) < e}. In the case G = ®, the normal cumulative, Huber (1964) obtained the 
most robust Wọ for e < 0.0303, Sacks and Ylvisaker (1972) for e > 0.0303. 


Received August 1984; revised July 1985. 

! Research supported by Natural Sciences and Engineering Research Council of Canada grant 
A-8603. 

AMS 1980 subject classifications. Primary 62G35; secondary 62G05. 

Key words and phrases. Robust estimation, M-estimators, minimax variance, Kolmogorov 
neighbourhood, minimum Fisher information. 


724 


MINIMAX VARIANCE M-ESTIMATORS 725 


Somewhat surprisingly, the general case of this problem seems not to have been 
addressed. 

We will assume throughout that G is fully stochastic, symmetric, strictly 
increasing on (— œ, œ), and has an absolutely continuous density g with respect 
to Lebesgue measure. The score function = —g’/g is assumed to be differenti- 
able except possibly at zero. The assumption of symmetry implies that Fp is 
symmetric [Huber (1981), page 89]. Although it is not assumed that [(G) < oo, 
the continuity of G ensures that K; is dense in K, [Vandelinde (1979), page 186]. 

Huber (1964) showed that the most robust yo has essentially the same form 
for all -contamination classes {F = (1 — e)G + eH; G’ symmetric, strongly 
unimodal}. In contrast, we will show that the aforementioned “small e” solution 
for K, does not extend in this way, but that it does apply in the presence of the 
requirement—strictly stronger than strong unimodality—that J(¢) = 2¢’ — & 
be decreasing on [0, co). Note that, under the requisite regularity, (1.1) becomes 
[J(,)d(F — Fy) = 0 by partial integration. Similarly, (G) = [J(€) dG. 

It is our thesis that the form of y) may be inferred quite generally from the 
behaviour of J(£). This approach was adopted by Collins and Wiens (1985) in 
determining general properties of least informative distributions in arbitrary 
e-contamination classes. In Section 3 it is applied to such distributions as G(x), 
with density proportional to exp({—|x|‘//), and to “Student’s” ¢-distribution. For 
distributions such as the ¢, and J(£) are nonmonotone, resulting in this case in 
six distinct forms of the solution, depending upon e and the degrees of freedom. 
Five of these are rather unwieldy; the sixth coincides with the Sacks—Ylvisaker 
“large e” form. This form is shown to apply for all sufficiently large Kolmogorov 
neighbourhoods, under very mild conditions on &. 


2. Necessary and sufficient properties of Fj. In this section we exhibit 
some conditions which are necessary and sufficient in order that F) have 
minimum information in K,. These lead to some heuristic considerations of the 
general form of y which motivate the solutions given, in Section 3, for some 
special classes of distributions G. 

Partition the support of fọ, in (0, 00), into disjoint sets 


B, = {xfmax(1, G(x) —e) < F(x) < min(1,G(x) + e)}, 
B, = {x|Fo(x) = G(x) - e}, 
By = {x|Fo(x) = G(x) + e}. 


Define a functional J on the set of continuous (except possibly at zero), piecewise 
continuously differentiable functions y by J(W) = 2W’ — W?. Extend J(w) by left 
continuity where 1’ is discontinuous. If y is discontinuous at zero, set J(ẹy X0) = 
sign(y(0*) — ¥(07)) - co, corresponding to use of the Schwarz derivative [Natan- 
son (1960)]. 

It turns out that JY) = constant on each component of B,. We note that 
the only solution to J(ẹ4) =à? is of the form Atan(A(x — w)/2) for some 
parameter w, and that those to J(W,.) = —X? are Atanh(—A(x — w)/2), A, and 
A coth(—A(x — w)/2). 


726 D WIENS 


THEOREM 1. If F, possesses the following properties, then ıt is the unique 
member of K} minimizing F) over K,. 


1. Fk, € K, By symmetric, F,(o0) = 1. 
2. F, has an absolutely continuous density fọ with respect to Lebesgue measure, 
and ya = —fo/fo is absolutely continuous on (— œ, œ). 
3. There exists a, possibly infinite, set of intervals [b,,a,,,], with 0 < b,, 
a, < b, <a,,,, limsup,_,,.a,*= a < 0; and constants d,, X such that 
(i) B, U By = U6, @,41]; 


Ais xe LO, bil, 
s. À x & ai; b, ? 
(1) Apax) = ( 
—)?, x> a, 


J(E)(x), = x € Int(B, U By); 


(Gii) f x € B,, then (g(x) = Avo Xx + 9), 
if x E By, then pox) < S(vy)(x + 0). 


Proor. It follows from “J(,) = —A*” on (a,oo) that »,=A>O and 
f(x) = fo(a)exp(A(a — x)) there. Here we use the fact that the tanh and coth 
solutions are both eventually negative (fọ increasing). In particular, yp, is 
bounded and [,(x) > 0 as x > +œ. On Int(B, U By) f(x) = g(x) > 0. On 
Ba, fy > 0 since no fa corresponding to a solution to J(~,) = A can descend to 
zero with y, remaining bounded. Thus fọ > 0 on (— 00, co), so that we need only 
verify (1.1), and that 0 < I(F,) < oo. By partial integration, (1.1) becomes 


(2.1) [I(do)d(F - Fy) > 0. 


It suffices to check this for symmetric F € K}. Assume that “a” is the only 
possible accumulation point of {a,}—the general case is similar. If we put 
H = F — F, and integrate {J(£) dH by parts on those nondegenerate intervals in 
B, U By, (2.1) becomes 


0< lim S JI(Yo) dH -X f” aH 


= lim f dL (IC po)(b,) ~ IC Yo)(b, + 0))H(b,) 


— 
rie D <a, Sanri 


nil 


+ 2 (Iola) — J( Yoa, sı + 0))H(a,,,) 
(2.2) b<a,,;S4, 


Oy d 
-MH(x)- E f HG) ENE) d) 


bCa, Sap % 


+ {Ipo any Hlan) — J(Yo)(a + O(a) | 


By 1 and 3(iii), all terms within the first set of braces are nonnegative, as is the 


MINIMAX VARIANCE M-ESTIMATORS 727 


limit of the remaining term. Thus (1.1) is satisfied. That 0 < I(F,) = fy} dF, < œ 
is obvious. O 


It is also necessary that Fp satisfy the conditions of Theorem 1. Since the 
necessity is not explicitly required, the proof (available from the author) is 
omitted. We note however that Fy € K, forces, in turn, the additional necessary 
conditions 


4(i) falx) = g(x), x © B, U By; 
4(il) Yax) — E(x) <s Oon B,(2 0 on By). 


In Section 3 we exhibit the minimum information distributions Fy for some 
particular Kolmogorov neighbourhoods. The general principle at work appears to 
be that for sufficiently small e, ẹyọ should differ from € only near the local 
extrema of J(£); and that here we should have J(,) = const, with this constant 
being less extreme than that attained by J(€). In line with (2.1), we should have 
fo > g, Fo — G increasing from —e to e, near the local minima of J(§), fp < g, 
F, — G decreasing from e to —e, near the local maxima. This is illustrated by 
Theorem 2 below. As e increases, the regions of constancy of J(W,) coalesce. It 
is shown in Theorem 3 that for sufficiently large e, the solution quite generally 
has Hya) = AX(e) on [—d(e), b(e)], Avy) = —A*(e) elsewhere, with (b) = 
G(b) — e and b,A,,A > 0 as ef 4. We conjecture that this “large £” form of the 
solution is universally valid. We also give examples (Examples 2 and 3 below) of 
classes of distributions for which there are intermediary forms of the solution. 


3. Some classes of solutions. The preceding discussion suggests that if 
J(£&) is decreasing on [0, œ), so that £(0*) > 0 as well, we should have B, = [a, b], 
B,, = ¢, 0< a< b < oo. Before proving this, we show that our monotonicity 
assumption implies that g is strongly unimodal. 


LEMMA 1. If J(€) is strictly decreasing on [0, 00), and continuously differen- 
tiable on (0, œ), then § 1s positiwe and strictly increasing on (0, œ). The converse 
is false. 


ProoFr. Under the stated conditions, any critical point x, of § must furnish a 
local maximum. Thus (x) > 0, and in order that £ not become negative on an 
unbounded interval there must exist an inflection point x, > x, at which 0 = 
g” (xi) < &(x,)é(x,) < 0, a contradiction. Thus ~ is monotonic and nonnegative 
on (0, œ). From this observation the result is immediate. 

Counterexamples to the converse are furnished by the distributions G, defined 
in the Introduction with 7> 2. 0 


If J(€) is merely decreasing on (0,00), Lemma 1 fails for, say, g(x) = (1+ 
2|x|)exp( —|x|)/6. Some distributions satisfying the conditions of Lemma 1 are the 
logistic, and those G, with 1 < f< 2. 


728 D. WIENS 


THEOREM 2. Under the conditions of Lemma 1, there exists e = e (G) such 
that jor « E {0, eo], KE) ts minimized over K, by that F, with 


bola) = (tan, e(2), A= ECD), 


flx) = { — yg a(x), (exp -Ella ~ b)) 


on [0, a], (a, b], [ 6, 2), respectively. The constants a, b, A, are determined by (i) 
F(a) = G(a) — e, (ii) (œ) = 1, and (iil) yala — 0) = (a). Thus B, = [a, b], 
B,. = ¢. Minimum information is 


IF.) = AX [G(a) pind Oe ti(hy ae] 4 f KE da}. 


The lumiting values are (e, a, b, 44, — X?) > (0,0, œ, J(E)(0), KEX c)), and e,(G) 
is defined by a(e,) = b(£p). 


PROOF. It is a straightforward matter—see Wiens (1985) for details—to 
establish the existence of constants a, b, A, satisfying (i)-(iii) and 


Elx) > v(x), xelo, a]; S(E)(a) <. 


Integrating this first inequality shows that fọ < g on [0, a]. The monotonicity of 
€ ensures that ¢ > y) and f > g on [b,œ), and that J(é)(b) > —X. The 
conditions of Theorem 1 are then satisfied, as long as a < b. O 


We now establish sufficient conditions under which the “large £” form of the 
solution is valid. 


THEOREM 3. Suppose that, on (0,00), E(x) satisfies (A.1) &(x) > 0, (x&(x)) 
> 0, (A.2) (E(x) /xy < 0, and (A.3) § has no local minima in (by, 0), where by is 
defined by 6,&(b,) = 1. Then there exists ©, = ¢,(G) such that for € € (e,,1/2], 
the minimum information Fy € K, ts described by 


2 
g(b)cos?— 
folz) = { xg, a(d)exp(-A(x = b)) 
cos* —— 


on [0,b] and [b, œ), respectively. The constants à, b satisfy G) (b) = 


MINIMAX VARIANCE M-ESTIMATORS 729 


G(b)— e, (i) M(œ)= 1, and (ili) a(b) < (6). Thus B, = {b}, By = ¢. 
Minimum information is 


I( Fy) = 2(% +X )(G(b)-!-e)- X. 


The limiting values are (e, b, ^, X°) > (5, 2,0, 0). 


Proor. The identity (xg(x))’ = g(x)(1 — x&(x)), together with (A.1), implies 
that lim, _, ,.»g(x) = 0, lim, _, ,x&(x) < 1, lim, x(x) > 1; hence the existence 
of a unique point b, as in (A.3). It is easily checked that if (i)-(iii) are satisfied, 
and if Fy € K,, then the conditions of Theorem 1 are met. 

Similar to the development in Sacks and Ylvisaker (1972), (A.1) implies the 
existence of e,(G) < i such that (i)-{iii) are satisfiable for £ > e,. Then (A.2) 
ensures that on [0, b], (x) remains above the line segment joining (0,0) to 
(b, ¥,()), hence above the convex function y. As in Theorem 2, this implies 
that g > f, on [0, b], so that G > Fy = G — e there. Alternatively, this may be 
established under the conditions of Lemma 1. Now (A.3) implies the existence of 
e(G) © [ey, 4] such that for e = &, K remains within the boundaries of the 
Kolmogorov strip on (b, co). As in Theorem 2, if (A.3) is replaced by the stronger 


(A.3)’: E(x) > 0 on (0, 00), 


then we may take £; = e,. See Wiens (1985) for the details. 0 


COROLLARY 1. Under the conditions of Lemma 1, the least informative 
F, € K, ts as described in Theorems 2 and 3, with ¢,(G) = &(G). 


EXAMPLE 1. Those distributions G,, 1 < ¢< 2, are covered by Corollary 1. 
Huber (1964) and Sacks and Ylvisaker (1972) obtained ¢,(G,) = 0.0303. Working 
through the numerical details of Theorem 3 extends the result to the Laplace 
distribution (/ > 1), with e,(G,) = 0. 


Theorem 3 applies to those G, with / < 1, and we find «,(G,;) = 0.0355, 
E (Goz) = 0.0153. Although assumption (A.2) of Theorem 3 fails for G, if / > 2,a 
slight modification to the proof shows that the conclusions apply to these cases as 
well, with €, = e,. 


EXAMPLE 2. Denote by G,(x) the “Student’s” t distribution on r d.f., with 
E(x) = (r+ 1)x/(r + x”). Theorem 3 applies, but Theorem 2 does not. The 
function J(é,.)(x) = (r+ 1X(2r — (r + 3)x?)/(r + x”)? attains a positive maxi- 
mum at 0, decreases to a negative minimum at (r(r + 7)/(r + 3} = M,, then 
increases to 0 at oo. The discussion in Section 2 then suggests that for sufficiently 
small e, say e <«,(r), there should exist points a, b,c,d, with O<a<b< 
M, <c<d such that F, has B, =[a,6], By = [c,d]. More precisely, this 


730 D. WIENS 


“Stage I” solution is given by 


A,X =A 
plx) = [Aten (x), A,tanh{ (x = »)}. f(x), A= e(d)}. 
ah, 
2 





Àx ; 
g(x eos? —— g( beost' 
f(x) = Te Bl); A T 
coSs 2. cosh —#(b — «) 


(x = o) 


? 


g(x), g(d )exp(—{{d)(x — d)) 


on [0, a], [a, b], [b,c], [c,d], [d, oo), respectively. See Figure 1. The seven 
constants are determined by the conditions F(a) = G(a) — e Fi(e) = G(c) +e, 
F (œ) = 1, and continuity of f at c and of y, at a, b, c. Given the existence of 
such constants, the conditions of Theorem 1 are easily verified. 


For r = 1, some numerical values of the constants are given in Table 1 below 
for this, and the three subsequent stages. Stage II differs from I in that a = b 
and (6) < (b), and is valid for e € [e,(1), €,,(1)] = [0.00573, 0.02515]. Stage 
ITI has as well c = d and ¥,{c) > &(c), for e,(1) < e < 0.0377 = e,,,(1). Stage IV 
is then as described in Theorem 3, and is obtained by letting w — œ in Stage III. 

Since Theorem 2 becomes applicable at r = œ, it is clear that this sequence of 
stages cannot hold for all r. Numerical investigations have shown that it is in 
fact only valid for r = 1. For r > 2, Stage II is altered by requiring c = d, a < b, 
Yale) > (c). On a range 2 < r < R, Stage III then has a = b, ¥ (6) < (b). For 
r > R, it has instead a < b, c = d, (ce) < G(c) + e. In each of Stages I-III, 
Byloasr-> œ. 

Collins and Wiens (1985) obtained the most robust yọ for an e-contamination 
neighbourhood G, of G,, and found it to be of the form exhibited in Figure 1, 
without the “tan” and “constant” portions. This reflects the fact that in K,, 
maxima of J(§) may be dampened by removing mass from g, whereas in G, only 
minima may be handled, by adding mass to (1 — e)g. 


EXAMPLE 3. If $ is positive, decreasing, and convex on (0, œ), then J(&) is 
negative and increasing there but J(£)0) = +œ. Examples are the distributions 
G(x), £< 1, for which Theorem 3 applies. 

In view of the Dirac delta in J(£) at 0, we expect that for small values of e, 
J(,) is constant on three contiguous intervals symmetric about zero, and in 
neighbourhoods of + œ. As at 3(iii) of Theorem 1, Fy cannot remain on the lower 
boundary of the Kolmogorov strip, in (0, 00). The “small e” solution should then 
be obtained from those same equations defining the Stage II, r = 1 solution of 
Example 2. It is then easy to see that the remaining two stages must be as for the 
Cauchy distribution. For G,., see Wiens (1985) for some numerical values. 


731 


MINIMAX VARIANCE M-ESTIMATORS 





= a tanh(—y4{x-)) 


0.8 
0.6 
0.4 


0.2 
d 


0.0} 
a/b 1 
Fic. 1, Most robust pa for a Kolmogorov neighbourhood of the Cauchy dist bution, with e = 0.005 
(Stage I) The constants are given in Table 1, and the horizontal axis ts In(] + x). 
REMARKS. 1. Some feeling for the geometry of a Kolmogorov neighbourhood 
is given by the “infinitesimal loss of Fisher information” d/de I(F),,.9. In 
general, if F is determined by equations (i)—(iii) of Theorem 2, or by (i) and (ii) of 


Theorem 3, then 
d 
I(Fy) = ~2(2(e) + (6). 
For G,, 1 < 1 < 2, this varies monotonically from — œ at e = 0 to 0 at e = 4. For 


the logistic, it varies from —4 to 0. 


2. Consider the ¥' neighbourhood of G, defined by -Z = (F|{|f — g|dx < 
e}. If F is determined as in Theorem 2, or as in Theorem 3 with (A.3’) holding, 


bgd 


732 D. WIENS 


TABLE 1 
Least informatwe Fy in Kolmogorov newhbourhoods 
of the Cauchy distribution 
Stage E a b c d A M A w I1/I(%) 
I 0 0 v2 v2 æ 2 1.165 0 2 


0.001 0.411 0.800 2.66 159.15 1.81 1.04 6.013 4.09 2.05 
0.005 0.599 0.635 3.55 31.81 1.64 0.94 0.063 4.88 2.22 
0.00573 0.620 0.620 3.66 27.75 1.63 0.93 0.072 4.99 2.26 


II 0.006 0.622 3.70 26.50 1.62 0.92 0.075 5.03 2.27 
0.010 0.657 4.26 16.87 1.53 0.87 0.125 5.57 2.45 
0.025 0.765 6.19 6.26 1.29 0.71 0.312 7.54 3.19 


0.02515 0.766 6.22 6.22 1.29 0.71 0.313 7.56 3.20 


IIT 0.026 0.772 6.24 128 0.70 0.32 7.67 3.24 
0.030 0.797 6.31 123 0.67 0.39 8.31 3.45 
0.035 0.825 6.21 117 0.62 0.51 9.93 3.72 
0.0377 0.839 6.08 1.14 059 0.59 œ 3.86 

IV 0.0377 0.839 1.14 0.59 3.86 
0.0535 0.946 1.02 0.54 4.71 
0.0608 1.00 0.98 0.51 5.17 
0.1512 1.56 0.58 0.28 16.74 
0.3366 3.85 0.16 0.05 484.12 
0.6 00 0 0 oo 





then F, € Z}. Since the symmetric (hence less informative) subclass LE of Zi 
is contained in K,, F minimizes information over Z, as well. Note that for 
e > 1, inf{I(F)|F € £3.) = 0. 


REFERENCES 


COLLINS, J. R. and Wiens, D. P. (1985). Minimax variance M-estimators in e-contamination models. 
Ann. Statist. 13 1078-1096. 

HUBER, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Staust. 35 73-101. 

HUBER, P. J. (1981). Robust Statistics. Wiley, New York. 

NATANSON, I. P. (1960). Theory of Functions of a Real Variable 2. Ungar, New York. 

SACKS, J. and YLVISAKER, D. (1972). A note on Huber’s robust estimation of a location parameter. 
Ann. Math. Statst. 43 1068-1075. 

VANDELINDE, V. D. (1979). Robust techniques in communication. In Robustness in Statistics. (R. 
Launer and G. Wilkinson, eds.) 177-200. Academic, New York. 

Wiens, D. P. (1985). Minimax vanance M-estimators of location in Kolmogorov neighbourhoods. 
Research report DALTRS-85-6, Dalhousie Univ. 


DEPARTMENT OF MATHEMATICS, STATISTICS, 
AND COMPUTING SCIENCE 

DALHOUSIE UNIVERSITY 

HALIFAX, Nova Scotia B3H 4H8 

CANADA 


The Annals of Statistics 
1986, Vol 14, No 2, 733-742 


ON OPTIMAL DECISION RULES FOR SIGNS OF PARAMETERS 


By Yosser Hocuserc! anp Marc E. POSNER 


New York University 


The problem of deciding the signs of k parameters (6,,...,9,) = @ based 
on (6,,...,8,) ~ N(8, X) such that p,{any error} < a V @ is discussed by 
Bohrer and Schervish (1980). They characterize a desirable class of procedures 
called locally optumal. For the case k = 2, X = I, and a < }, they present a 
particular rule from this class called the double cross. In this paper, we 
address the problem of selecting a best rule from among all locally optimal 
rules when k = 2 and 2 = I. When a < §, the double cross is shown to be an 
attractive choice. Other rules are obtained for higher values of a. We also 
examine a more general optimization criterion than the one used by Bohrer 
and Schervish and obtain different optimal rules for several classes of prob- 
lems. The optimal rule corresponding to one of these classes has no two-deci- 
sion region. A modification of the formulation is offered under which a 
well-known rule (with two decision regions) emerges as the unique optimal 
procedure, 


1. Introduction. A common statistical problem in comparative experiments 
is the simultaneous decision of the signs of several parameters based on normally 
distributed estimators. As examples, the parameters of interest might be mea- 
sures of the effects of several competitive drugs (relative to a control) or measures 
of several side effects of one drug or measures of the carcinogenic potential of 
various materials. 

This problem was first considered by Neyman (1935) who developed goodness 
criteria for decision rules concerning signs. Lehmann (1957) gave a decision 
theoretic formulation and characterized some unbiased and optimal rules. Another 
early work on this subject was by Kaiser (1960). Bohrer and Schervish (1980) and 
Bohrer (1982) provide further results and indicate additional applications for this 
statistical problem. 

To formalize the problem, let Ê~ N(@, £) where diagonal elements of the 
correlation matrix X are all 1. Based on the vector 6 and the known matrix =, we 
want to decide for each component 6, of 8 whether it is positive or negative in 
such a way that the probability of making at least one incorrect decision is no 
larger than a specified value a. Bohrer (1979) showed that if a < 0.5, the 
condition 


(1) P,{no incorrect decision} > 1—a V8 ER? 


requires the inclusion of a third decision about each @,. This decision must not be 
incorrect under any value of 9, and is usually interpreted as “no decision” or “no 
classification.” Generally, it is associated with low values of |6|. By introducing 


Received December 1984; revised August 1985. 

! On leave from Tel Aviv University. 

AMS 1980 subject classification. Primary 62J16. 

Key words and phrases. 'Three decision rule, locally optimal, generalized optimization functions. 


733 





734 Y. HOCHBERG AND M. E. POSNER 


the third decision, we are reducing the expected number of classifications to get a 
sufficiently high probability of no incorrect decision under all values of 8 (in 
particular the ones near 0). 

We restrict our discussion to the case of two parameters with i.1.d. estimators. 
Some comments will be provided on the problems of extending the results to 
more general cases. 

For the class of decision rules that satisfy (1), Bohrer and Schervish consider 
those procedures that are symmetric and upper convex. Symmetry of the decision 
rule can be expressed in terms of the following two implications: 


D(6,, 6, ) = (i, J) Ti D(ĝ,, 6.) (a (J, i), 
D(6,, 6.) a (i, J) Sr D(+ĝ,, +6, ) =; (+t, +J), 
where D(6) = (D,, D,) is the decision vector for the signs of 8, and 8, and each 
D, takes on one of the values —1, 0, or 1 indicating the decision 6, < 0, no 
classification, and @, > 0, respectively. 
A rule is upper convex if whenever 6 leads to making two classifications 
D = (+1, +1), then for all c,, c, > 1, (c,6,, c,9,) leads to the same two decisions. 
Bohrer and Schervish (1980) define a locally optimal rule as a symmetric, 


upper convex rule satisfying (1) that maximizes the expected number of correct 
decisions in the limit as 6 goes to 0. To identify such a rule they let 


xo = p{D = (0,0)|6 = 0}, 

x, = p{D = (0,1)|/6 = 0}, 

x, = p{D = (1,1)|6 = 0}. 
Under the symmetry conditions of (2), x, also represents the probability of each 
of the three decision profiles D = (1, —1), D = (—1,1), and D = (—1, —1). 
Similarly, x, represents the probability of each of the decision profiles D = 
(—1,0), D = (1,0), and D = (0, — 1). The restrictions placed on the decision rules 
imply the following linear relations: 
From (1), 


(3) tot 2% tzr; Sl a: 


(2) 


from (1), upper convexity, and independence, 

(4) Ian 

and from probability theory, 

(5) Xo + 4x, + 4x, = 1. 

As 8 > 0, the limit of the expected number of correct decisions is given by 
(6) 2x, FA 


The resulting problem is a linear program where (6) is maximized subject to (3), 
(4), (5) and x, = 0 V i. The solution for a < } is 
a 3a? 


(7) xa =l- 2a + 2a’, Maa ae Xa. 


OPTIMAL DECISION RULES 735 


When ! < a < 4, the solution is 


4a a 
and for |< a, 
(9) x= 0, xy = (), 1, = = 


Thus, any symmetric rule satisfying upper convexity and either (7), (8), or (9) 
is locally optimal. The case (7) is of main interest since in practice a is usually 
smaller than 4. For the case when a < 4, Bohrer and Schervish (1980) identified 
a rule in the class of locally optimal rules called the double cross. In this paper 
we set up a criterion for selecting a best procedure from the class of locally 
optimal procedures and for (7) end up with a formal justification for the use of 
the double cross. Also, we examine the effects of the modification of the 
optimization criterion (6) on the class of decision rules. 

There are some difficulties in extending our results to the case of dependence. 
The main problem is characterizing those procedures that are locally optimal and 
then controlling the probability of at least one error under all 0. This difficulty 
was already noted for the k = 2 dependent case by Bohrer and Schervish (1980). 
They showed that the double cross does not control the probability of any error 
under all @ if the conditions for local optimality are satisfied. 

Problems also occur for higher dimensions (k > 2). In the case of indepen- 
dence, a “double cross type” rule can be shown to be locally optimal under 
natural conditions of symmetry and upper convexity. However, again there is 
difficulty in establishing the required control of the probability of any error 
under all 0. 


2. Why the double cross? When a < +} the double cross depicted in Figure 
1 is a locally optimal procedure. It has a (0,0) region in the shape of a cross 
centered at the origin that is imbedded in a larger cross formed by the union of 
all one-classification regions and the (0,0) region. z, is the 1 — a quantile of a 
standard normal variable and z, > z, is determined so that the probability of 
the shaded region in Figure 1 under 8 = 0 is 1 — a. This procedure decides that 6, 
has the sign of Ê, if |4| > z,, or if |Ê] > z, and |@,_,| > z,. Bohrer and Schervish 
(1980) proved that it controls the probability of any error under all 6. However, 
they did not provide further explanation as to why this rule should be singled out 
from among all locally optimal rules, For instance, the locally optimal rule that 
controls the probability of any error and minimizes the area of the (0, 0) region is 
depicted in Figure 2. 

In the followifg’we provide some justification for choosing the double cross. 
Let L be the class of locally optimal procedures. When a < 4 the two-decision 
region of any procedure in L is the same as the two-decision region given in 
Figure 1, while the probability of the (0,1) region under 8 = 0 is (a — 3a?) /2 (see 
(7)). 

A possible criterion for selecting a procedure from L is to maximize the 
expected number of correct decisions at some @ + 0. Let 8 > 0 indicate a vector 8 


736 Y HOCHBERG AND M. E. POSNER 


a, 
Pa / 
P / 
y / 
Ane 
(-,+) (0,+) / Pa ER 


“a 
ees, 50) 2 en HY ae ame 


(-,~) (0,~-) (+,-) 


Fic. 1. The double cross. 


with 8,2 0 for all ¿ and @,>0 for at least one i. Also, let T = Positive 
Orthant  {(0,1) region}. 


THEOREM 1. The rule in L that maximizes the expected number of correct 
classifications at 8 > 0 has a region T of the form 


d T* = {(6,, 6,)|e%"*cosh(6,6,) + e%:cosh( 6,4, ) 

1 
> c(a,6,,9.),0<5 8, <z, < ô}, 

where c(a, 8,,6,) ts determined so that p{T*|6 = 0} = (a — 3a?)/4. 


PROOF. A symmetric procedure in L that maximizes the expected number of 
correct decisions at § > 0 has a (0,1) region such that the probability of {(0,1) 
region} U {(1,0) region} is maximal at 8. Since the (0,1) region is symmetric 
about 6,=0, if (6, 6,) € T, then (—6,, 6,) € {(0,1) region}. Further, 
(8,, ĝ )(8,, — Ê ,) € {((1,0) region}. To find T* a generalized Neyman-Pearson 


OPTIMAL DECISION RULES 737 





Fic. 2. The locally optimal rule that minunizes area of (0, 0) region 


lemma can be used. Accordingly, any point (6,, 6,) in the positive orthant for 
which the generalized likelihood ratio exceeds a critical constant d(a, 9), Le., 


A 


falô., 6.) + fel —9,, z) T fo( 9, ô.) + fal 92, —6;) 
(11) ry a eer ey eee) ere) ey Ba d(a,#) = 0 
fo( 4, 2) + fo(- 1 99) + fol 2» 91) + fo( Fe, E 1) 
( fa(9,, 95) is the joint normal density of (6,, 8,) with mean 8) is in 7*. The 
numerator and denominator in (11) are the contributions to the likelihood of the 
{(0, 1) region} U {(1,0) region} by a point (@,, 8.) and the associated symmetric 
points (—6,, 8,), (8, 9,), (4, —9,) under 6 and under 0, respectively. Substitution 
for fg(-, -) in (11) gives 
exp — 1(62 + 62 
ep ATE peated + 7M) + eM1( eh + e~%41)] > d(a, 0). 
By letting 
E 4d(a, ð) 
TT C 62)” 
(10) follows. O 


COROLLARY 1. As 610, the region defined by (10) converges to the (0,1) 
region of the double cross. 


Proor. Let {h(6,)|0 < 6, < z,} represent the boundary of T* with the (0,0) 
region. Therefore, 


(12) @%'?ieogh(0,6,) + e%*@eosh(6.8,) = c(a, 01, 02) for0 <Â, <z,. 


738 Y. HOCHBERG AND M. E. POSNER 


Implicitly differentiating (12) with respect to 6, and then solving for h’(6,) gives 
—9,e%*Pginh(6,6,) — 6,e%*? sinh( 6,6, ) 
bze cosh(0,4,) + 0,e%*) cosh(6,,) 


When 6 —> 0+, the limit of h’(6,) > 0. This implies that the limiting form of A(-) 
is A(-) = constant. O 


h'(6,) = 


This corollary implies that for any rule r € L where r + double cross, there 
exists an e > 0 such that 


E,(number of correct classifications under r ) 
< E, (number of correct classifications under double cross) 
for all ||8|| < e. 


Note that the double cross is also the rule in L that minimizes the maximal value 
of |6,| for which no decision is made on 8. 

Applying a similar approach to (8) for }< a < #, gives the unique optimal 
procedure depicted in Figure 3. For (9), the case when #< a, the optimal 
procedure is unique without additional restrictions to the problem. 


3. Generalized optimization functions. The optimization function (6) used 
by Bohrer and Schervish (1980) is based on the number of correct decisions in the 
various decision regions. Thus, a region with one correct and one undecided 
classification has the same value as a region with one correct and one incorrect 
classification. This is somewhat unappealing as a region with one correct and one 
undecided classification under some circumstances might be more attractive. 
Also, we wonder why the doubly correct classification region should necessarily 
have twice the value of a one correct, one undecided region. More generally, are 
the relative values assigned to the different types of regions in the computation of 
(6) always justified? 

These considerations lead us to postulate a more general objective function 
that is an arbitrary weighted combination of the probabilities of the various 
regions. Usually, the weight will increase in relation to the level of attractiveness. 
Due to symmetry, the probability under ĝ = 0 for each of the two classification 
regions is identical. Similarly, all one-classification regions have the same prob- 
ability. Thus, it can be shown that for any « € (0,1), and for any weighted 
combination of the probabilities of the regions, we can formulate an optimization 
problem. Since x, = 1 — 4x, — 4x, (see (5)) and x, = 0, the optimization prob- 
lem can be formulated in terms of only x, and x, as follows: 

maximize ax, + bxo, 
subject to 2x, + 3x, < a, 
2 
(13) Xo S a’, 
x, +2, < 1}, 


tita 20, 


OPTIMAL DECISION RULES 739 


(-1,1) 
(~1,-1) 


Fic. 3. The optumal procedure for the solution (8). 


93 





where a and b are constants obtained from the weights given to the regions. We 
assume that a, b > 0. 

Note that the Bohrer and Schervish formulation is a special case of (13) with 
a= 2, b= 4. For a=0, b= 1, the objective function becomes the (limiting) 
probability of making two correct decisions. When a = 2, b = 3, we are opti- 
mizing the (limiting) probability of making at least one correct decision. 

Whenever a/b < *, we get locally optimal conditions identical to (7), (8), and 
(9). The discussion in Section 2 suggests good choices for the optimal rule. 
For a/b > 4 and a < }, the solution to (13) is x, = a/2, x, = 0. By direct 
application of the limiting arguments given in the prior section we have the 


740 Y. HOCHBERG AND M. E. POSNER 


(0,+) 


| 


(0,~) 


Fic. 4. Anoptunal rule whena/b2 - anda < =, 


unique rule that is depicted in Figure 4. For 1 > a/b > 2 and }<a < 3, the 
solution is x, = 2 — a, x =a -— } and the double cross can be used as the 
optimal rule. When ?< a the solution x, = 0, x, = } yields a unique optimal 
rule. 

Procedures with no two-decision region (such as the one shown in Figure 4) 
seem unappealing. It is hard to accept a rule that does not classify 0, for large 
values of @. Nevertheless, such procedures are optimal for the given set of 
constraints and objective function. One approach to circumvent this situation is 
by adding some restrictions to L. Notice that upper convexity is only defined for 
the two-decision region. It seems logical to extend the upper convexity require- 
ment to the one-decision region. This implies that if 6 leads to D, = +1, then for 
all c}, ca = 1, (¢,6,, caĝ,) leads to the same D. Under this generalized upper 
convexity requirement we have the following result. 


THEOREM 2. There is a unique symmetric and generalized upper convex rule 
that maximizes (13) for 1 > a/b > 4 and a < 4, which satisfies (1). For this 
rule, the (0,0) region is a square Iĝ] < K and the two-decision regions are of the 
form (Â, 6,)| CA > K, i = 1,2} where K is the (1 — «)'” percentile of a stan- 
dard normal variate. 


OPTIMAL DECISION RULES 741 


Proor. The {(0,0) region} N {(1, 1) region} + {}. Otherwise, there exists a 
point (Ê, 6, ) E {(0, 1) region} such that for some c > 1, (c6,, 6.) E {(1,0) region}. 
This would violate generalized upper convexity for D,. Suppose that the intersec- 
tion of the (0,0) and (1,1) decision regions consists of more than 1 point. Since 
x, + x, = | is not a binding constraint, there exists an e > 0 such that x, can be 
reduced by e, x, increased by Że, and the solution remains feasible. As this new 
solution has a higher value, the original one is not optimal for (13). Consequently, 
every optimal rule must have the (0, 0) and (1, 1) regions intersect at precisely one 
point. 

To satisfy upper convexity, the (1,1) decision region must contain the set 
S = {(6,, 6,))min{9,, 6,} > K} for some K > 0. If S is a proper subset of the 
(1,1) region, x, can be reduced and x, increased as above. Thus, in an optimal 
rule the (1,1) decision region is of the form {(6,, 6,)imin{6,,6,} > K > 0}. 

Suppose the (0,0) region is not a square. By rearranging the areas (keeping x, 
and x, constant), we can find an equivalent rule where the (0, 0) and (1, 1) regions 
intersect at more than one point. By a prior argument, x, and x, are not optimal 
for (13). As a result, the (0,0) region is square. 

Now, for some K > 0, the optimal solution can be written as x,(K) = 
[o(K) — &(—K)][1 — ®(K)], x, =[1 — ®(K)]*. Let y = (K). The objective 


A 


®5 


| 


(-,+) (0,+) (+,+) 


mem (0) EE (+,0) a a) 
ai 


(-,-) (0,-1) (+5-) 


Fic.5. The Sp cross. 


742 Y. HOCHBERG ANI) M. E. POSNER 


function of (13) becomes ¥( y) = a(2y — 1X1 — y) + b(1 — y)*. Since 0 < K < 
x, 0.50 < y <1. Yy) < 0 for y € (0.50, 1) if a/b < 1. Hence, (y) is monotone 
decreasing and attains the maximum value on the boundary of the constraint set. 
Since the constraints 2x, + 3x, < a and x, < a” are equivalent to y > (1 — a)!” 
and y 21 — a, respectively, the optimal solution has ®(K) = (1 — a). This 
completes the proof. 0 


The optimal rule according to Theorem 2 is depicted in Figure 5. This rule is a 
well recognized procedure that is discussed by Bohrer (1979) and Bohrer and 
Schervish (1980). It has been named the Sp cross and was originally discussed by 
Spjetvoll (1972) for a slightly different problem. 


We have omitted the case of a/b = ~ from our discussion. When a < +, the 


optimal solution to this case will not be unique. Thus, examination of additional 
criteria will be required to fix x, and x,. We leave this to future research. 


REFERENCES 


BOHRER, R (1979). Multiple three-decision rules for parametric signs. Amer Statist Assoc 74 
1j2— 137 

BOHRER, R (1982) Optimal multiple decision problems. some principles and procedures applicable in 
cancer drug screening. In Probabuity Models and Cancer (L Le Cam and J. Neyman, 
eds.) 287-301 North-Holland, Amsterdam. 

BOHRER, R and SCHERVISH, M (1980). An optimal multiple decision rule about signs Proce Nat, 
Acad Sa. U S.A. T7 52-56. 

KAISER, H. (1960) Directional statistical decisions. Psych. Rec 67 160-167. 

LEHMANN, E lL. (1957) A theory of some multiple decision problems. Ann Math. Statist. 28 1-25. 

NEYMAN, -} (with cooperation of [WASZKIEWICZ and KOLODZIESCZIK) (1935). Statistical problems in 
agricultural experimentation. J. Roy. Statist Soc. (Suppl.) 2 107-154. 

SPIATVOLL, E. (1972). On the optimality of some multiple comparison procedures Ann. Math. 
Statist 43 398—411. 


GRADUATE SCHOOL OF BUSINESS ADMINISTRATION 
New YORK UNIVERSITY 
NEW YORK, NEw YORK 10006 


The Annals of Statisties 
1986, Vol 14, No 2, 743-752 


ORTHOGONALITY OF FACTORIAL EFFECTS 


By CHAND K. CHAUHAN AND A. M. DEAN 
Indiana- Purdue University and Ohio State University 


A necessary and sufficient condition is given for a specified factorial effect 
to be orthogonal to every other factorial effect, after adjustment is made for 
blocks. The results are extended to the case of regular disconnected designs. 
The structure of a generalized inverse of the intrablock matrix is investigated 
when certain pairs of factorial spaces are orthogonal. A useful class of designs 
exhibiting partial orthogonal factorial structure is identified and examples are 
given. 


1. Introduction. When a factorial experiment is arranged as an incomplete 
block design, some degree of nonorthogonality is necessarily introduced into the 
analysis. For ease of interpretation, therefore, it is frequently desirable to use a 
design which admits an orthogonal analysis of the main effects and interactions 
(after adjusting for block effects). Such designs are said to have “orthogonal 
factorial structure.” A set of sufficient conditions for a design to have orthogonal 
factorial structure was given by Cotter, John, and Smith (1973), but there exist 
many designs which are orthogonal yet do not satisfy these conditions. Mukerjee 
(1979) gave a set of necessary and sufficient conditions for orthogonal factorial 
structure which are useful for constructing classes of such designs [see Mukerjee 
(1981)]. As pointed out by John and Smith (1972) and Mukerjee (1979), several of 
the well-known classes of designs (such as group divisible, generalized cyclic) 
exhibit orthogonal factorial structure for particular sets of factor levels. It is 
frequently the case, however, that in factorial experimentation high-order inter- 
actions are assumed to be negligible and, in such cases, a design with complete 
orthogonal factorial structure is unnecessary. All that is required is a design 
which admits an orthogonal partition of contrasts belonging to the low-order 
factorial effects. Mukerjee (1980) considered this problem and adapted his 1979 
conditions to give a set of necessary and sufficient conditions for the orthogonal- 
ity of all interaction effects of order less than or equal to a fixed number, ¢. 

Unfortunately, the conditions of Mukerjee (1979, 1980) give no information on 
the orthogonality of any specified pair of factorial spaces. Thus if the conditions 
are violated, it is not known which of the factorial spaces are orthogonal and 
which are nonorthogonal (see Example 1). The purpose of this paper is to show 
that part of this information is, in fact, available in the course of checking 
Mukerjee’s conditions (see Section 3), and to show that useful designs exhibiting 
partial orthogonal factorial structure are readily available (see Section 4). The 
results are extended to regular disconnected designs in Section 5. In addition, in 
Section 3 the structure of a generalized inverse of the intrablock matrix is 
‘investigated when certain pairs of factorial spaces are orthogonal. 


Received February 1984; revised September 1985. 
AMS 1980 subject classyications. Primary 62K15; secondary 62K10, 15A09. 
Key words and phrases. Factorial experiments, orthogonal factorial structure, incomplete block 
designs. 
743 


744 C. K. CHAUHAN AND A. M. DEAN 


2. Notation and preliminaries. Consider a factorial experiment with p 
factors F,, F,,..., F, where the jth factor has m, levels and v = [1?_,m,. The 
v treatment combinations are written as p-tuples, a = (a, Gg,...,a,) where 
O<a,<m,-—], j= 1,..., p. We use the convention of writing the treatment 
combinations in lexicographical order (i.e., ascending numerical order when 
viewed as p-digit numbers). We represent a generalized interaction by a*, where 
xX = (X,,%Xg,...,%,) and x,=1 if factor F, is present in the interaction, and 
x, = 0 otherwise. For brevity, we use the term “interaction” to mean “main 
effect or interaction.” Let ®* be the lexicographically ordered set of all binary 
vectors x = (X,,...,%,) and denote the ith element of ©* by ¢,, 7 = 1,...,2°. 
Then a*' denotes the general mean and is nonestimable in any incomplete block 
design. Let ® = {$z $3,- -, P } where n = 2”. For x + y € ®, a” and a” repre- 
sent different generalized interactions and hence define different factorial spaces. 
The factorial space corresponding to a”, can be represented by a vector space V_, 
of dimension [1?,,(m, — 1)”. A set of basis vectors for V, is given by a set of 
orthogonal contrasts in the treatment parameters corresponding to the interac- 
tion a*. We shall be interested in the independence of the estimators of contrasts 
in the treatment parameters, having adjusted for block effects, where the con- 
trasts belong to different vector spaces V, and V,, x # y € ®. 

Let the factorial experiment be arranged in b blocks where the jth block is of 
size k, < v, and the ith treatment combination is observed a total of r, times in 
the design (2 = 1,...,0; J=1,..., b). 

The usual intrablock model will be assumed, namely 


Y= Bt+T +B, +e, (i=1,...,0; J= 1,..., b), 


where y, is the yield of the plot in the jth block which received the ith 
treatment combination, 1, is the effect of the ith treatment combination, £, is 
the effect of the jth block, u is a constant, and e,, are independent normal 
random variables with zero means and homogeneous variances o?. 

The reduced normal equations for estimating the treatment effects having 
adjusted for blocks are 

AT = Q, 

where 
(2.1) Q = T — Nk`°B, A =r?-— Nk ’N', 
and where T and B are vectors of treatment totals and block totals respectively, 
N is the incidence matrix for the design, rê and k? are diagonal matrices of 
treatment replications and block sizes respectively, ? is the intrablock estimator 
of + = (Ti, %,.--,7,), and ’ denotes transpose. 

A solution to the normal equations is given by 

7 = RQ, 

where {2 is any generalized inverse of A, that is AQA = A. i 

Apart from Section 5 where disconnected designs are considered, it will be 
assumed that rank(A) = v — 1, so that all contrasts in the treatment parameters 
are estimable. Let C* be a q X v matrix where q > II(m, — 1)”, such that the 
rows of C* form a set of contrasts spanning the vector space V, x € ®. 
Symbolically we may write a = C*r. It was shown by Kurkjian and Zelen (1962) 


ORTHOGONALITY OF FACTORIAL EFFECTS 745 


that one such set of contrasts is given by C* = v~'M", where 
(2.2) M* = M7 ® Mj? ® --- @ MF 


and ® denotes the Kronecker product of matrices, and 
(m Em — j 


m, 


M>» = 
i m, en, if x, = 0, 
where e is a column vector of m, elements each equal to m7”, Siig Cg Olas 


and J,, 18 an m, X m; identity matrix. 
Any other set of contrasts spanning V. may be written as 


(2.3) C* = Ch e Cy? @ --- Cy, 
where C> = RM% with Rọ = m; for x, = 0, and for x, = 1, Ry is any s, X m, 
matrix of (m,— 1) where s,>m,-1, j=1,..., p. Thus C* may be 


expressed as C* = R*M* where R* = RD @ RYO --- OR. 

Without loss of generality, we select C* so that C*C* = I. Hence the rows of 
C* form an orthonormal basis for V,, and s,=(m,— 1) when x,=1 for 
J=1,...,p. Note that for x + y € *, CC = 0, 

The covariance between the minimum variance unbiased estimators of the 
parametric functions C*r and Cr, after adjusting for block effects, is given by 
Cov(C*?, C?) = C72C%o?, for any generalized inverse, 2, of A. A design has 
orthogonal factorial structure if and only if C*QC> = 0 for all x + y € ® [see, 
for example, Cotter, John, and Smith (1973)]. 


3. Orthogonality of factorial spaces. In Theorem 1 a necessary and suff- 
cient condition is given for a specified interaction, a*, to be orthogonal to all 
other interactions, a’. Thus it follows that, in the course of checking the 
conditions given by Mukerjee [(1979), Theorem 3.3 and (1980), Theorem 2.2] for 
orthogonal factorial structure, information is in fact obtainable on the pairwise 
orthogonality of factorial spaces. 

Let C*, V, A, and ® = [¢,,...,¢,], n = 2”, be defined as in Section 2. The 
proof of Theorem 1 requires the following lemma. 


LEMMA 1. Let Wand Q be two real symmetric v X v matrices, and let Q* be 
the Moore—Penrose inverse of Q. If QW = WQ, then (i) QQ* W = WQQ* and 
(ii) Q W = Wa". 


Proor. Using the properties of the Moore-Penrose inverse, and the fact that 
QW = WQ, 
(i) QQ*W= Q*QW = Q*WQ = Q+ WEQ*Q = Q QWQQ* 
= QQ QWQ* = QWQ* = WQQ* 
(ii) WQ* = WQ"QQ* = Q*QWQ", using (i) 
= Q WQQ* = Q*QQ*W, using (i) 
= QW. O 


746 C.K. CHAUHAN AND A.M DEAN 


THEOREM 1. A necessary and sufficient condition for a‘, x = ọ, E€ ®, to be 
orthogonal to a”, for all y = ọ, € ®, j #1, is that C*C‘ commutes with A. 


Proor. (i) Necessity. Assume that a* is orthogonal to all œ, x = $,, y= 9, 
$, + $, € ®, and let H” = [Cer ..., CP", CP... CC® Y. Then, following the 
arguments in the proof of Theorem 3.1 of Mukerjee (1979), 


(3.1) A = C*C*AC'C' + H“H'AH'H™. 


Since C*C* =I and C*H* = 0, it follows from (3.1) that C’C' and A 
commute. 

(ii) Sufficiency. Assume that C*C* and A commute, then from Lemma 1, 
C*C* and At commute, hence 


C*AtH® = C*C7C*A*H" = C*A*CC'H™ = 0. 


Therefore, setting Q = A*, the orthogonality of a and all a, y#xeEO, 
follows. O 


Note that for all contrast matrices C* as defined in (2.3), C‘C* = bM> M` for 
some constant b, where M* is defined in (2.2) [see Dean (1978), Lemma 1]. 
Therefore C*C* commutes with A if and only if M*M* commutes with A. 
Hence if the condition of Theorem 1 is satisfied for all x € ® of order < t, then 
Theorem 1 implies Theorem 2.2 of Mukerjee (1980), and if t = p, then Theorem 
3.3 of Mukerjee (1979) follows. 

Note also that, since A is symmetric, the condition of Theroem 1 is equivalent 
to the condition of symmetry of C*C*A, and hence that of M* M`<A. 


EXAMPLE 1. Consider a design for a 4 x 2 X 2 experiment in 16 blocks of size 
6 obtained by adding in turn the treatment combinations (000, 100, 200, 300) to 
the following four blocks 


000 001 100 101 210 ~= 311 
001 010 101 110 211 300 
010 O11 110 111 200 301 
011 000 111 100 201 = 310 


where addition of two treatment combinations a,a,a, and b,b,6, is defined by 
C,CoC3 = Qiao + 6,b,6,, where c, = a, + bmod m,, 1 = 1,2,3. 


The concurrence matrix NN’ is block circulant of the form {Q QQ, @,} 
where each Q, is a 4 X 4 circulant matrix; Q, = {6,2,0,2}; Q, = {2,4,2,1}; 
Q, = {0,2,4,2}; Q, = {2,1,2,4}. It may be verified that M*M'‘NN’ (and hence 
M* M*A, since the design is proper and equireplicate), is symmetric for all x 
except for x = (110) and x = (111). Hence all pairs of interactions are orthogonal 
with the possible exception of (a'!®, a'!'). 

The condition of Mukerjee [(1980), Theorem 2.2] does not hold for this 
example, and therefore could not be used to deduce the orthogonality of all pairs 
of effects of order less than or equal to 2. 


ORTHOGONALITY OF FACTORIAL EFFECTS 747 


COROLLARY 1. If a” is orthogonal to all a”, y # x € ®, then the intrablock 
matrix, A, can be expressed as A = A, + A, where the rows and columns of A, 
belong to V,, and the rows and columns of A, are orthgonal to V,. 


Proof. Follows directly from (3.1). 0 


Corollary 1 relates the orthogonal factorial properties of the design directly to 
the structure of the intrablock matrix. Theorem 2 shows that any generalized 
inverse of the intrablock matrix exhibits a similar structure when two factorial 
spaces are orthogonal even if the conditions of Theorem 1 are not satisfied. 


LEMMA 2. Let P, X, and Q be real nonzero matrices such that the product 
PX®Q exists, then PXQ = 0 if and only if X = X, + X, where the columns of X, 
are orthogonal to the rows of P, and the rows of X, are orthogonal to the 
columns of Q. 


PrRooF. Follows from Rao and Mitra (1971), Theorem 2.3.2 and the proof of 
Theorem 2,4.1b. O 


THEOREM 2. Cov(C*?,C?t) = 0 for a specified pair of interactions a” and 
a’, x, y E€ Ọ, x +y, if and only if any generalized inverse, R, of the intrablock 
matrix can be expressed as Q = QF + 23, where the columns of QF are or- 
thogonal to V, and the rows of 23 are orthogonal to V,. 


PrRooF. Follows directly from Lemma 2. O 


COROLLARY 2. If a” represents a main effect (of the first factor without loss 
of generality), then Cov(C*7, CF) = 0 for any specified a*, x + y € Ọ if and 
only if C*2.P” has constant rows, where P” = I„ ® e, ands = v/m. 


Proor. (i) Sufficiency. C” = KY Im, 8 ez) = K°P?, for some (m, — 1) X m, 
orthonormal contrast matrix K’. Hence if C*Q2P- has constant rows then 
CRC! = (C7QPY)K” = 0. 

(ii) Necessity. If C*QC” = 0 then, from Theorem 2, C72 = C*22, where the 
rows of C*(Q2} are orthogonal to V,. Hence CFR} = em, © B, where B is some 
q X s matrix and g =[|(m, — 1)». Hence C*QP” is of the form ef, ® Be, = 
€m, ® b, (where b, is a vector of length q) as required. O 


m 


COROLLARY 3. (a) If a” represents a first-order interaction (between the first 
two factors without loss of generality), then Cov(C*#, C1?) = 0 for any specified 
a*, x #y EO, if and only if C*QP” can be expressed as [D,, Dp,..., Da |, 
where D, is of dimension q X m2, q = TI(m, — 1)”, and D;— D, has constant 
rows, l= 1,...,m,, and where P” =I,, @1,, ® e,, t= v/mymy. 

(b) If (a) holds and D, = D= -:: = Dn; then a* is also orthogonal to the 
main effect of the first factor. 

(c) If (a) holds and D, has constant rows, f=1,...,m,, then a” is also 
orthogonal to the main effect of the second factor. 


748 C. K. CHAUHAN AND A. M. DEAN 


Proor. (a) (i) Sufficiency. CY = K*P” where K’ = K, @ K, and K, is an 
(m,-~ 1) X m, orthonormal contrast matrix, i = 1,2. Also, by assumption, 


C*QP” = (eh, $ D,) + [0, D; — D,,..., Dm, ~ Di] 
= (eh, @D,)+(Qe@e"/,). 


my 


Hence C*2C” = (C*OP”)(K, ® K.) = 0. 

(ii) Necessity. If C*QC” = 0 then, from Theorem 2, C*Q = CQ} where the 
rows of C*QZ are orthogonal to V,. Hence C*03 = (en, ® B, 8 B,) + (B, ® 
em, ® B,) for some matrices B,, Bo, B,, B, of dimensions q X mz, 
qx m,,qXt,q xt respectively, q =[I(m, — 1)”, t= v/m,m,. Hence 
C*QP” = (en, 8 B, 8 hy) + (B, 8 ef, @ h,) where h, and h; are vectors of 
length q. Let B, = [b,, bz... bm] and let D;= (B, 8 h,) + (6,8 ef, 8 ho), 
f= 1,..., m,, and the result follows. 

(b) Assume that (a) holds, and D,= D, for all /= 2,..., m, then from (3.2), 
C*QP” = ef, ® D,. Let a* represent the main effect of the first factor, then 
P% = P”(I,, ® em,) Hence C*QP* = e; ® Diem, which has constant rows. 
Hence from Corollary 2, a% and a’ are orthogonal. 

(c) Assume that (a) holds and D,= dzem, for some constant d,, ?= 1,..., M). 
Hence from (3.2) 


C70P” = del, @ eh,) + [0,d,—dy,...,d 


Let a? represent the main effect of the second factor then P” = P? (em, ® I,,,). 
Hence C*QP” =[d, + ylen, where y = X(d,— d,), which has constant rows. 
Hence from Corollary 2, replacing the first main effect by the second, a* and a? 
are orthogonal. O 


(3.2) 


ee d,| 8 en. 


m 


EXAMPLE 2. Consider the design of Example 1. We have already shown, 
using Theorem 1, that all pairs of main effects are orthogonal for a 4 xX 2 x 2 
experiment. However, this is a convenient example to illustrate the use of 
Corollaries 2 and 3. To check the orthogonality of a°!° and &™!? using Corollary 2, 
we calculate C*QP” where x = 010 and y= 100. Without loss of generality, 
choosing C* = R*M* gives 

apr- (0. 0 0 A 
ue ° 0 0 0 
as required. 
To check the orthogonality of 2° and a’! using Corollary 3, 


C*QP” = R*(e, 8 (ml — Ja) 8 e,)Q(L @ L 8 ez) 

1-1 1 -l 1 —1 ra] 
1 1 —li1 1 -l 1 —l1 1 
= [D,, D, D3, D,], 


where D, is 2 x 2 and D,— D, has constant rows, f= 1,...,4 as required. Since 
D, = D, = D} = D,, Corollary 3(ii) verifies the orthogonality of «°'° and a'™. 


=f 


ORTHOGONALITY OF FACTORIAL EFFECTS 749 


Classes of designs exist in which certain pairs of factorial spaces are or- 
thogonal. One such example is given in Section 4. More generally the characteri- 
zation of designs with certain orthogonality properties can be based on Theorem 
2 or Corollaries 2 and 3. However, this is a problem for further research and will 
not be pursued in this paper. 


4. Designs with partial orthogonal factorial structure. For a given de- 
sign, if there exists an x € ® for which C*C* and A do not commute, then from 
Theorem 1, there exists at least one y#x € ® such that a* and a” are 
nonorthogonal. Such interactions, a”, can be identified by calculating the covari- 
ances C*QC” for all y # x. 

Designs which exhibit orthogonality between many, but not all, pairs of 
factorial spaces will be said to have partial orthogonal factorial structure. In this 
section a useful class of designs possessing such a structure is identified and an 
example is given. 


DEFINITION. Let D(m,, mz,..., m,,) denote the class of designs such that if 
d € D(m,, mz,...,m,) then the Moore-Penrose inverse, Aj, of the intrablock 
matrix of d can be written in the form 


tb 
(4.1) Ay = VE(RT@--- OR OQ, O@R™B--- OR”), 
j=l 
where w is some integer, ¢,,...,¢, some constants, Rọ is an m, Xm, 


permutation matrix, r<s-—1, and Q, is some square matrix of dimension 
M4 {Typo t Mgr 

THEOREM 3. If d © D(m,, mz,...,m,) then the design d is guaranteed to 
have partial orthogonal factorial structure for a p-factor experiment whose ith 
factor has m, levels, i = 1,..., p. 


Proor. If de D(m,, m,,...,m,) then Aj is given by (4.1). Let g = {r+ 
l,r+ 2,...,8 —1}. Consider the generalized interactions a% and a’, where 
x= Capertee) and y= (),,..., Jp) such that there exists at least one i ¢ g for 
which x, = 1 and y, = 0. Then with 2 = Aj, using (4.1) and (2.3), CR Cpr = 
C'en = 0. Hence, a” and a” are orthogonal. If x, = y, foral i <r and i 2 $, 
then “the orthogonality of a* and a” depends upon the structure of Q,, 
1,...,W. Hence the design is guaranteed to have partial orthogonal factorial 
Saarcire O 


Examples of designs in the class D(m,,mz,...,m,) can be derived from 
designs in the class D,(n,, no,..., n) which have orthgonal factorial structure 
for a q-factor experiment where the ith factor has n, levels, i = 1,...,q and 
where Į In, = [Im,, by factorizing and/or combining factor levels. 


EXAMPLE 3. The generalised cyclic design, dọ, in 16 blocks of size 6 with 
generating block (00,01, 10,11, 22,33) has orthogonal factorial structure for an 


750 C. K. CHAUHAN AND A. M. DEAN 


experiment with two factors each at four levels. Hence d, € D,(4,4). It can be 
verified that the Moore-Penrose inverse of the intrablock matrix is of the form 


Ag, E 2, ERG 8 Ris); 
f= 
where R$, is a 4 X 4 circulant permutation matrix, i = 1,2 [see John and Smith 
(1972)]. 

Consider the design d E D(4, 2,2), formed from d, by mapping the levels of 
the second factor to the levels of two new factors each at two levels, respectively. 
If the mapping gives a lexicographical ordering of the treatment combinations 
then 


Az=Ai = L e( Ri, 8 R4) 
f= 


= LE (Rh @ Ri, 8 Q3) 
j=) 


since any 4 X 4 circulant permutation matrix R$ can be expressed as }( R2, ® 
Q?,) where R2, is a 2 x 2 circulant permutation matrix, and Q?, is a2 X 2 matrix 
of 0’s and 1’s. The design d is the design considered in Example 1. It follows from 
Theorem 3 that for this design all the pairs of interactions (a7'*2*), a2”) are 
orthogonal for xx, + yi y2. The orthogonality of the remaining pairs of interac- 
tions (a!™, a!!); (a, 9); (a! a!!!) may be checked using Theorem 1 or by 
calculating the covariances directly. In Example 1, Theorem 1 was applied to 
show orthogonality of the first two pairs. It is possible for this design to deduce 
that (a!!°, a!!!) cannot be orthogonal, since otherwise in Example 1, M*M* NN’ 
would have been symmetric for x = 110 and x = 111. If a!!! can be assumed to 
be negligible the orthogonality of this pair of interactions is of little importance. 

Note that designs in the classes D(2,2,4), D(2,4,2), D(2, 2,2,2), D(2,8), and 
D(8,2) may be obtained by similar methods and Theorems 1 and 3 applied in 
each case. 


5. Disconnected designs. Let A, ®, C*, and V, be defined as in Section 2, 
and let V be the vector space spanned by the rows of A. If rank(A) < v — 1 then 
the design is disconnected. Let V7 = V, A V, then V? is the vector space of all 
estimable contrasts corresponding to a”. 


DEFINITION (Mukerjee, 1979). A ‘disconnected incomplete block design is 
regular if ® V} = V, where ® represents direct sum over all x € ®. 

In irregular designs estimable contrasts belonging to factorial effects do not 
span the space of all estimable treatment contrasts. For regular designs results 
corresponding to Theorems 1, 2 and Corollaries 1, 2, and 3 hold. 


Let B* be a matrix whose rows form an orthonormal basis for Vž. 


THEOREM 4. If A is the intrablock matrix of a regular disconnected incom- 
plete block design, then the estimable contrasts corresponding to a`, x = $, E€ ®, 


ORTHOGONALITY OF FACTORIAL EFFECTS 751 


are orthogonal to all other estimable contrasts if and only if B* B* commutes 
with A. 


Proor. Follows exactly the lines of the proof of Theorem 1, using Mukerjee 
(1979), Theorem 4.1. D 


Theorem 2 can be extended to regular disconnected designs as follows. V? and 
VY are assumed to be nonnull, otherwise the theorem is trivially true. 
THEOREM 5. Cov(B*?, BF) = 0 for a specified pair of interactions a% and 
a”, x, y E È, x +y, if and only if Q can be expressed as QI + 22, where the 
columns of QT are orthogonal to V} and the rows of Qg are orthogonal to Vý 
(where V* and Vý are nonnull). 


Proor. Follows directly from Lemma 2. D 


Corollaries 1, 2, and 3 can be extended in the obvious way by replacing C* 
with B*, V, with Vž, and redefining H”. 


EXAMPLE 4. The generalized cyclic design, dy, in 18 blocks of size 8, with 
generating block (00 11 15 20 33 42 44 53) has orthogonal factorial structure 
for an experiment with two factors each at 6 levels. Hence d, € D,(6,6). The 
design do is disconnected, the confounded contrast belonging to a'!. Consider the 
design d € D(2,3,6) formed by a lexicographical mapping of the levels of the first 
factor to the levels of two new factors. 


u 
Ag = Ad, =) Es RE, ® RS.) 
fe | 


to 
= LE(Ri, ® Qh 2 Ria). 
pol 
From Theorem 3 all pairs of interactions are orthogonal with the possible 
exception of (a, a”), (a!) a!!9), and (a!®!, a!!!). Checking Theorem 4 shows 
that B*B*A is symmetric for x = 001 and 100, but not for x = 101 nor 111. 
Hence the first two pairs of interactions are orthogonal but not the last pair. 
Alternatively Corollaries 2 and 3 can be checked. The confounded contrast lies in 
Vio. ® Vin 


Acknowledgments. The authors would like to thank the associate editor 
and the referees whose considered and constructive suggestions have considerably 
improved the presentation of this paper. 


REFERENCES 


COTTER, S. C., JOHN, J. A. and Sarh, T. M. F. (1973). Multifactor experiments in non-orthogonal 
designs. J Roy. Statist Soc. Ser. B 35 361-367. 

DEAN, A. M. (1978). The analysis of interactions ın single replicate generalized cyclic designs. J. Roy 
Statist. Soc. Ser. B 40 79-84. 


752 C. K. CHAUHAN AND A. M. DEAN 


JOHN, J. A. and SMITH, T. M. F. (1972). Two factor experiments in non-orthogonal designs J. Roy. 
Statist. Soc. Ser. B 34 401-409. 


KURKJIAN, B. and ZELEN, M. (1962). A calculus for factonal arrangements Ann. Math. Statst. 33 
600-619. 


MUKERJEE, R. (1979). Inter-effect orthogonality ın factomal experiments. Calcutta Statist Assoc. 
Bull. 28 83-108. 


MUKERJEE, R. (1980). Further results on the analysis of factorial experiments. Calcutta Statst. 
Assoc. Bull. 29 1-26. 


MUKERJEE, R (1981). Construction of effect-wise orthogonal factorial designs. J. Statist. Plann 
Inference 6 391-398, 


Rao, C. R. and Mirra, S. K (1971). Generalized Inverse of Matrices and its Applications. Wiley, 


New York. 
DEPARTMENT OF MATHEMATICS DEPARTMENT OF STATISTICS 
INDIANA-PURDUE UNIVERSITY THE OHIO STATE UNIVERSITY 


Fort WAYNE, INDIANA 46815 COLUMBUS, OHIO 43210 


The Annals of Statistics 
1986, Vol 14, No 2, 753-758 


AN EFRON-STEIN INEQUALITY FOR NONSYMMETRIC 
STATISTICS! 


By J. MICHAEL STEELE 
Princeton University 


If S(x,, X9,...,X,) is any function of n variables and if X,, K,lsisn 
are 2n ii.d. random variables then 


n 
varS <1EY(S-S)’, 
tm] 
where S = S(X,, Xq,..., Xn) and S, is given by replacing the :th observation 
with X,, s0 S, = S(X,, Xo,..., £, ---, Xn). This is applied to sharpen known 
variance bounds in the long common subsequence problem. 


1. Introduction. In Efron and Stein (1981) the following result was estab- 
lished: If S(x], xo,...,X,-,) is a symmetric function of n — 1 variables and 
X,, Xo,---, Xp are independent, identically distributed random variables then for 
S, = S(X,, Xo, 00) Xiao Xin- Xn) and S = n EZS, one has 


(1.1) var S(X,, Xo,-.., X,_,)s EY (S,-S)Y. 
=] 

This inequality was motivated by a desire to understand the nature of the bias 
in the jackknife estimate of variance, but it has also proved useful in the 
probabilistic analysis of algorithms, Steele (1981, 1982b). There have been exten- 
sions of the Efron-Stein inequality to the case where one drops out more than 
one observation from S (Bhargava (1980)), and there have been new proofs of the 
result by Karlin and Rinott (1982) and Vitale (1984). 

The purpose of this note is to establish an analogue to (1.1) which is not 
burdened by a symmetry hypothesis. It will be proved using the Hilbert space 
technique introduced in this context by Vitale (1984). 

Finally the inequality is applied to a problem of string comparisons by means 
of long common subsequences, a problem considered at length in Sankoff and 
Kruskal (1983). The best known bound on the variance of the longest common 
subsequence is improved, and a new & string comparison problem is introduced. 


2. Main results. Let S(x,, x2,...,x,) be any function of n arguments and 
consider the statistics formed by 


S = S(X,, X,..., Xn) 
and S, = S(X,, X} .-., Xo £o Xisqy-++) X_) where the X, and £, are 2n 


Received June 1985; revised July 1985. 

| Research partially supported by NSF Grant DMS-84-14069. 

AMS 1980 subject classifications. Primary 60E15; secondary 62H20. 

Key words and phrases. Efron-Stein inequality, variance bounds, tensor product basis, long 
common subsequences. 


753 


704 J. M. STEELE 


independent random variables with the distribution F. In other words, the S, are 
formed by redrawing the ith datum independently and then recalculating S. We 
will prove the following inequality: 


(2.1) varS < 1E (S-S)’. 
im] 

First we can check that there is no loss of generality in assuming that 
ES* < œ. To do this, consider new variables which resample the first i observa- 
tions, so S, = S(X,, De. aT: Gee mene Xa) where X, and X,,1<i<sn 
are 2n i.i.d. random variables. Setting S, = §, one has by Schwarz’s inequality 
that 


(2.2) a z| 2 (S, ii 8n) 


n n-i a 
sS (5 È} ES, a $) 

Since $, — S,,, has the same distribution as S — S,,,, we see from (2.2) that 
the right-hand side of (2.1) is infinite unless var S < œ. This shows (2.2) holds 
when ES? = oo and thus lets us focus on the case ES? < oo. 

By elementary Hilbert space theory we know that there are functions ¢, such 
that p(x) = 1 and E¢,(X)¢,( X) = 4,,, i.e, we choose $}, 0 < k < œ to be an 
orthonormal basis for L?(dF). Further, if we let k = (k,, ka, ..., kn) denote a 
multiindex then the variables defined by 


p(X) z oy(X1, Mosse) Xn) = ILZES 


are orthonormal and there are constants c(k) such that 


(2.3) S(X,, X2,...,X,) = Lelk)ou( X) 


holds almost everywhere. Here we have just expressed S in what is sometimes 
called the tensor product basis for L°(dF dF --- dF). By orthonormality we 
have ES? = ¥,c*(k) and varS = E, . .c?(k). All we need now is to relate these 
identities to the right-hand side of (2.2). 

Without Joss of generality we can assume that ES = 0 so c(k) = 0 if k= 
(0,0,...,0). We first note that E(S — S.)? = 2ES? — 2ESS,. When X, is sub- 
stituted into (2.3) to give an expansion for S, we see that 


(2.4) E(SS,)= }, c¢?(k), 


k k=O 


since the orthonormality of the 9$,(X,) and the independence of 
X,, Xy,..., X,,-.., X, and £, cause all other summands to have expectation 


AN EFRON-STEIN INEQUALITY 755 


zero. Summing over i we have 


(2.5) EY (S-S,)’ = 2nES? — 2¥.c?(k)z(k), 
t=] k 

where 2(k) = £7 I (k, = 0), i.e., z(k) is equal to the number of indices k, of the 

multiindex k which equal zero. Since we were able to assume that c(0) = 0, we 

have 2(k) < n — 1 for all c(k) + 0. This is a crucial observation, which applied to 

(2.5) gives us 


(2.6) EY. (S -SY > 2nES? — 2(n — 1) ).c?(k) = 2ES? 
tex] k 
which completes the proof of the main result. 

One can easily extend this result to the situation where one redraws m 
observations, 1.e., one considers S(X,, X, eer x, , ..-, X p) We let [m] denote a 
subset of {1,2,..., n} of cardinality m and denote by S; m] the statistic obtained 
by replacing X, by Ê for i, E€ [m]. The extension we seek for (2.1) is the 
following: i 


(2.7) p . |var s < Dai - Sim). 


ty 


1 


The observations that we can assume ES? < œ, ES =0, and S= 
Y.c(k)¢,(X) go on just as before. Now in calculating ESS my we get 
Ly eGtmjc (kK) where G[m] is the set of all k such that k, = 0 if i € [m]. Hence 
using E(S — Sim)? = 2ES? — 2ESS,,,,; and summing overall m subsets [m] 
contained in {1,2,..., n}, we get 


HE (S ~ Sim)” = (im) ES? - Zeiz) 


> (7 | ES? — Ge *) Derk) = pa 1 | gs2. 


This completes the proof of inequality (2.7). 
Before attending to applications it is worth recording two key remarks. 


(1) The arguments x, of S need not be real numbers, or for that matter even 
vectors, although the case of vectors is by far the most important. The proofs 
of (2.1) and (2.7) depend on the structure of the arguments x, only in the very 
shallow sense that we need L*(dF) to have a countable basis. 

(2) The inequalities (2.1) and (2.7) are both sharp as one can see from choosing 
DCX ot osneee ee 2 ts 


3. Applications to long common subsequences. A benefit of the two 
variance bounds of the last section is that they provide reasonably tight bounds 
on the variances of statistics which may be computationally difficult or even 
intractable. One illustration comes from the theory of string comparison. 

The length of the longest common subsequence of the two random strings is a 
statistic which arises in an amazing variety of fields from biology to computer 


756 J. M. STEELE 


science (see, e.g., Sankoff and Kruskal (1983)). Formally, we consider 2n indepen- 
dent, identically distributed random variables X, and X/,1 <i < n, which take 
values in a finite alphabet .%, and let 

L, = max{k: X, = X, X, =X 


Jo?" * 


me f 
-> X, = X, where 
1<i,<i,< +--+ <ip<nandi <j <j <` <j, <n}. 


Chvátal and Sankoff (1974) initiated the asymptotic study of EL,, proved that 
EL,, ~ cn, and established some bounds on e. Much subsequent work has been 
done on c by Deken (1979) and Gordon and Arratia (unpublished). 

The most intriguing special case is that of coin-flip sequences; that is where X, 
and X; are independent random variables with success probability +. In that case 
the value c = 2/(1 + ¥2) is consistent with all of the known bounds and 
computational experience. This tidy expression was put forth by Richard Arratia 
and subsequently we found the following suggestive heuristic. 

By a “good k pair” we will denote any pair of subsequences of length k of the 
X, and the X; which coincide. We let Z denote the total number of good k pairs 
which can be found in the two n strings of the X, and X/,1<i<n. The 
expectation of Z is easily determined, 


E(Z) = aaay 


and, by considering ratios of successive choices of k, it is easy to see that E(Z) is 

unimodal and the mode occurs for that integer value of k nearest n/(1 + y2). 
We can also get a handle on the number of good k pairs by noting that every k 

subsequence of the longest common subsequence gives a rise to good k pair, so 


there are at least bn of them. Now by the usual unimodality of the binomial 


coefficients, as k varies this sequence is unimodal with mode equal to the nearest 
integer to L_/2. The heuristic leap of faith is that in expectation these two modes 
are within a distance of o(7) of each other. A proof of that leap would prove that 
Arratia’s suggested value for c is the correct one. 

A second interesting problem concerning L, is the conjecture of Chvatal and 
Sankoff (1974) that var(L,,) = O(n?/3). It was put forth in Steele (1982a) that 
var( L,,) < (n? + 1)*. As an illustration of the power of (2.1), we can now give 
an easy proof of the stronger result, 


(3.1) var L, <n; 

in fact we can show 

(3.2) var L, < n{1 -~ 5 p2), 
a Ex 


where p, = P(X,=a)= P(X,=a), for a€. Since LiewDyu; We have 
Sew Pe =|). To get (3.2) from (2.1) we consider S = L(X,, Xa... 
Xa Xi, X5,..., X,) and consider S as a (nonsymmetric) function of 2n variables. 
Changing any one of those variables will change S by at most one. Moreover if, 
say, X, is replaced by X, then P(X, = Ê) = Ep? so P(S = S,) = Ep2. These 


AN EFRON-STEIN INEQUALITY 757 


two facts give us E(S — S,)? < 1 — Lp? and there are 2n such summands, so we 
have established (3.2). This is a long way from the conjectured var L,, = o(n?^), 
but it is the sharpest known result. The ease with which it comes from (2.1) is 
` surprising if one initially studies L, from a combinatorial perspective like that of 
Chvátal and Sankoff (1974). 

It is tempting to try to improve (3.2) by use of the change-m inequality (2.7). 
To do so would require improving the naive bound 


E(S fe Sia): s< m?, 


since this bound leads only to a variance bound given by the ratio m(n — 1Xn — 
2)---(n— m + 1)/(m — 1). This bound is not even linear in n for fixed m > 2, 
and trying to optimally choose m for fixed n does not help since the bound is 
minimized for m = 1. One seems to need new combinatorial insights to improve 
the naive bound on (S — S,,,;)”, and thus to use (2.7) with effect. 

It is likely that (2.7) is in fact sharper than (2.1). Karlin and Rinott (1982) 
found that to be the case with the Bhargava (1980) version of the original 
Efron—Stein inequality and there is no reason to expect our version to break with 
tradition. 

The hard part of using (2.7) is not a lack of sharpness but rather an excess of 
complexity. One has to find a way to get strong information on how L,, changes 
as one changes a substantial part of the sample. This is harder than getting a 
decent bound on the possible variation due to changing a single observation. 

The longest common subsequence problem has a natural analogue for k strings 
for which much of the preceding theory goes through with little change. One 
benefit of the & string analogue is that it gives a second handle on the constant c. 

To define the simplest incidence of the k-sequence problem we consider k = 3; 
let Y, = (X,, Xj, X;’) and set 


La = max{ t: X, = K; E Xk RERI X, a X; E Xk, J> 


where 1 <1, << ++: <hSn lsh ++: <ySn,1S5k, < <k 
n. The same proof that Chvátal and Sankoff (1974) give for k = 2 will show that 
lim EL,/n =e, 
n-a OO 


and the same proof given by Deken (1979) shows that L,/n — cz, almost surely. 
It would be of interest to relate c, to c, and one is tempted to speculate that 
C, = c” (and more generally that c, = c*~'). Computational evidence does 
not yet rule this out. The application of (2.7) to this new functional gives 
var L, < 4n(1 — Lp?) in the case k = 3. Again, this seems difficult to improve. 


Acknowledgments. I am indebted to Michael Waterman and Louis 
Gordon for stimulating this work, and to Richard Arratia for his comments on 
c = 2/(1 + y2). 


REFERENCES 


BHARGAVA, R. P. (1980). A property of the jackknife estimation of the variance when more than one 
observation is omitted. Technical Report No. 140, Dept. of Statistics, Stanford University. 


758 J. M. STEELE 


CHVATAL, V. and SANKOoFF, D. (1975). Longest common subsequences of two random sequences. «J. 
Appl. Probab. 12 306-315. 

DEKEN, J. P. (1979). Some limit results for longest common subsequences. Discrete Math. 26 17-31. 

EFRON, B. and STEIN, C. (1981). The jackknife estimate of vanance. Ann. Statist. 9 586-596. 

KARLIN, S. and RINoTT, Y. (1982). Application of ANOVA type decomposition for comparisons of 
conditional variance statistics including jackknife estimates. Ann. Statist. 10 485-501. 

SANKOFF, I). and KRUSKAL, J. B., eds. (1983). Tune Warps, String Edits, and Macromolecules: The 
theory and practice of sequence comparison. Addison-Wesley, Reading, Mass. 

STEELE, J. M. (1981). Complete convergence of short paths and Karp’s algorithm for the TSP. Math. 
Oper. Res, 6 374-378. 

STEELE, J. M. (1982a). Long common subsequences and the proximity of two random strings. SIAM 
J. Appl. Math. 42 731-737. 

STEELE, J. M. (1982b). Optimal triangulations of random samples ın the plane. Ann. Probab. 10 
548-553. 

VITALE, R. A. (1984). An expansion for symmetric statistics and the Efron-Stein inequality. In 
Inequaltes in Statistics and Probability, IMS Lecture Notes— Monograph Series 5 (Y. L. 
Tong, ed.) 112-114. IMS, Hayward, Calif. 


DEPARTMENT OF STATISTICS 
PRINCETON UNIVERSITY 
PRINCETON, NEW JERSEY 08544 


+ 


The Annals of Statistics 
1986, Vol 14, No 2, 759-765 


CHI-SQUARE GOODNESS-OF-FIT TESTS FOR RANDOMLY 
CENSORED DATA! 


By M. G. HABIB? AND D. R. THOMAS 


Oregon State University 


Two Pearson-type goodness-of-fit test statistics for parametric families 
are considered for randomly right-censored data. Asymptotic distribution 
theory for the test statistics is based on the result that the product-lmit 
process with MLE for nuisance parameters converges weakly to a Gaussian 
process. The Chernoff-Lehmann (1964) result extends to a generalized Pear- 
son statistic. A modified Pearson statistic is shown to have a limiting 
chi-square null distribution. 


1. Introduction. In this paper we consider the problem of testing the 
goodness of fit of a parametric family {F(¢; 0); @ € ©} of survival distributions 
from arbitrary right-censored data. Pearson-type chi-squared statistics which 
compare the Kaplan-Meier (1958) estimate F(t) to the parametric MLE F(t; 6) 
are studied. The random functions N}/?[ F(t) — F(t; 6)] are shown to have a 
limiting Gaussian process, which generalizes the result of Breslow and Crowley 
(1974) for NLE (t) — F(t, 0)] where 6 is the true value. From this result 
limiting distributions of the Pearson-type statistics are obtained. The limiting 
process result may be of more general use than for the Pearson-type statistics 
considered here. 

We use the random censorship model. There are N pairs of independent 
nonnegative random variables (X,, U,),( Xo, U2),..-,(Xy, Uy), where the X’s 
denote failure times and the U ’s the random censoring times. The observed data 
consist only of Y, = min(X,; U,) and the indicator functions ô, = I x, <u j for 
¿z= 1,..., N. Let H(u) = P(U > u) denote the unknown absolutely continuous 
survival function for the censoring variable and assume that the distribution of X 
belongs to a family of absolutely continuous survival functions { F(x; @): @ € ©} 
where ® is an open set in k-dimensional Euclidean space R*. We consider MLE 
§,, of the parameter 0 based on a random sample from the joint distribution of Y 
and 6 with density function 


(1) g(t, 6; 0) = [ f(t; HDI [F(t @)a(t)]}*° 


with respect to the product of Lebesgue measure on (0, oo) and counting measure 
on {0,1}, where f(t 0) and A(t) are the density functions corresponding to 
F(t; 0) and H(t), respectively. 


Received September 1981; revised September 1985. 

'This research was supported in part by USPHS grant CA27632 from the National Cancer 
Institute, DHHS. 

2 Currently at Kuwait University. 

AMS 1980 subject classifications. Primary 62G10; secondary 62E20. 

Key words and phrases. Goodness-of-fit, chi-square tests, censored data, product-limit process. 


759 


760 M. G. HABIB AND D. R. THOMAS 


We consider generalizations of Pearson type statistics to randomly censored 
data. Let 0 = t, < t, < +--+ <t, < œ denote boundaries for r + 1 cells. The cell 
boundaries could be random, e.g., for specified survival probabilities P, we may 
select Êy, satisfying F(Z,,; by) = P, as our boundaries. The test statistics are 
quadratic forms in the random vector 


(2) Zy iğ N'?( Fy a Fy), 


where Fy = (Fiy(t,),..., Fy(t,)y and Fy = (F(t,; 6y),..., F(t; 8y))’ are respec- 
tively the product-limit estimator and the MLE for the survival function. 

In Section 2 the product-limit process with estimated parameters Zy(t) = 

N'ES) — Ft fn): is shown to converge weakly to a Gaussian process under 
the null hypothesis Hp): F(t) € {F(t,@); @ € ©}. This generalizes the result of 
Breslow and Crowley (1974) for a completely specified survival function F(t). In 
Section 3 a modified Pearson statistic Q,/( 6.) and a generalized Pearson statistic 
Qn(Gy) are considered [see (7) and (8)]. ie modified Pearson statistic Q,(,) is 
shown in Theorem 2 to have a limiting x? distribution. For uncensored data the 
statistic Q y(n) reduces to that proposed by Rao and Robson (1974) and Nikulin 
(1973), with further development by Moore and Spruill (1975) and Moore (1977). 
The limiting distribution of the generalized Pearson statistic Qy(8,) is shown in 
Theorem 3 to be bounded by x?_, and x? distributions, which is a generalization 
of the Chernoff and Lehmann (1954) result. These asymptotic results hold for 
random cell boundaries as well as for fixed cell boundaries. 

Chen (1975) and Turnbull and Weiss (1978) proposed goodness-of-fit tests for 
composite null hypotheses with randomly censored data. Chen considered a 
generalized Pearson statistic Q(@), of the same form as (8), based on a modified 
minimum x? estimator 6. The statistic Q(@) was shown to have a limiting x7_, 
distribution under composite null hypotheses. Turnbull and Weiss considered a 
likelihood ratio test based on the more restrictive model where both the failure 
distribution and the censoring distribution are assumed to be discrete with finite 
support. Several tests have been suggested for the case of a simple null hypothe- 
sis with randomly censored data; see Koziol and Green (1976), Hollander and 
Proschan (1979), Fleming et al. (1980), Fleming and Harrington (1981), and Nair 
(1981). For the case of Type II censoring (when censoring occurs at specified 
ordered failures), Mihalko and Moore (1980) used sample percentiles as cell 
boundaries to obtain Pearson-type tests of fit that have limiting chi-square 
distributions for composite null hypotheses. 


2. Weak convergence of the process Zy(t). Let the random function 
Z(t) = NÊ (t) — F(t; @)] be defined on an interval [0, T] where 
H(T )F(T; 0) > 0. Breslow and Crowley (1974) proved that Z,(¢) converges 
weakly to a mean 0 Gaussian process Z(t) with 


f(z; 9) 
(3) Cov( Z(t), 2(s)) = F(t; 0) F(s; o) f ayt 


forO<s<t< T. 


GOODNESS-OF-FIT TESTS 761 


To show weak convergence of Z,(t) we make the following assumptions: 


(A.1) F(t; 0) and f(t; 0) are twice differentiable in 0 with continuous deriva- 
tives. 
(A.2) The information matrix J = J[8, H ] satisfies 


3? In f(t, 0) ain F(t, 0) 
J, = ere EO: 0) dt — [Saag FUG AACE) at 


for i, J = , k, is positive definite, and is continuous in @. 
(A.3) T MLE 6,, exists and is efficient with N'/2(6, — 0) = J~'Wy + op(1), 
where Wy is the nonmnalzéd acore vector 
N 


im} 


THEOREM 1. Let T < œ satisfy H(T)F(T; 0) > 0 for 0 € ©. Then, under 
the Assumptions A, the random function Z(t), for 0<t< T, converges 
weakly to a mean 0 Gaussian process Z(t) with 


Cov[Z(s), 2(¢)] 
dF(s; 6) ja aF(t; 8) 


= Cov[Z(s), 2(t)] - —z a0 


forO<s<t<T. 


Proor. Expand 2,(t) around 6, = 0 to give 


(4) Zy(t) = Zy(t) + Zh(t) + Ry(t), 
where 


x(t) = EEN maay- 8) 


and R(t) —> 0 in probability me in £. 

First we show convergence of finite-dimensional distributions of Z,(t) + Zy(£). 
For an arbitrary partition 0 < t, < --- < t, < T, let Zy = (Zy(t,),..-, Zy(t,))’ 
and Z*, = (Z*%(t,),..., Zy(t,))’, so that Z* = BN (by — 6), where the elements 
of B are B, = IFE; 0)/30,. The components of Zy and of N? (ĝẹ, — 0) can 
each be written, to order op (1), as a normalized sum of continuous functions of 
(Y,, 6,),--->(Yx, Sy). This follows from Breslow and Crowley’s (1974) results 
(7.9), (7.12), and Assumption A.3, respectively. Hence, from the Central Limit 
Theorem, we have 


(5) oor = a a4 i nfo, i ya) 


where the elements of V are given by (3) with V,, = Cov[ Z(t,), 2(t,)] and J is 
the information matrix given in A.2. Hence Zy = Zh > =Z + By ~ N(0, $). 


762 M. G. HABIB AND D. R. THOMAS 


Further, $ can be evaluated without direct computation of C in (5) as follows. 
Under A.1~A.3 it follows from the result of Pierce (1982) that Z and 7 are 
independent, and thus that V = $ + BJ~1B’. This gives the main result here 
that 


(6) Var(Z) = $ = V— BJP’. 


Having shown this, the weak convergence of Z,(t) + Z*(t) will then follow 
from marginal weak convergence of Z,(t) and Z%,(t) to continuous limits; see, for 
example, the argument used by Breslow and Crowley (1974, Theorem 4). The 
convergence of Z,(¢) is a standard result and that of Z%,(t) is clear, since it is a 
nonrandom vector function of ¢ multiplied by a random vector (free of t) with a 
limiting distribution. O 


3. The test statistics. Let V and >) denote respectively the estimators 
obtained from the covariance matrices V and £ by replacing 0 by the MLE by 
and the censoring distribution H by the product-limit estimator H Ne 

The modified Pearson statistic is defined as 


(7) Ân (by) = 2p? 2 y 
and the generalized Pearson statistic as 
(8) Qv(Gy) = ZyV Ên. 


The limiting distributions for these test statistics are developed in the following 
two theorems. The arguments are given for fixed-cell boundaries first, with 
subsequent extension to random-cell boundaries. 


THEOREM 2. Under composite null hypotheses, Assumptions A.1—-A.3 and 
that È is of full rank r, the statistic Ov y) has a limiting x? distribution. 


Proor. The components V, B, and J are each continuous in @. It can be 
shown that V is continuous in H with respect to the supremum metric over that 
interval [0,7] and J is continuous in H with respect to the supremum metric 
over the interval [0, œ). Since 6, and H, are consistent estimators it then 
follows that V and $ converges in probability to V and %, respectively. Theorem 
1 can then be used to complete the proof. O 


THEOREM 3. Under composite null hypotheses, Assumptions A.1—-A.3 and 
the assumption that the gradient matrix B is of full rank (k) the statistic Qy(@y) 
has a limiting distributions which is bounded by x2_, and x? distributions. 


Proor. From Theorem 1 and the convergence of V to V in probability it 
follows that Qy(8y) >; Z’V~'Z, where Z ~ N(0, £). Let A be a diagonal matrix 
of eigenvalues of V and P the corresponding orthogonal matrix of eigenvectors. 
Then let A* be a diagonal matrix of eigenvalues of A~!/?P’2 PA~1/? and P* the 


GOODNESS-OF-FIT TESTS 763 
corresponding orthogonal matrix of eigenvectors. We can then write 


(9) ZV UZ = 3 AE?, 


b=] 
where the £,’s are independent N(0,1). The eigenvalues A* satisfy the equation 
0 =|PA'?(A7? PS PA“? — AFT) AIP" 
=|$ = AV] 
= (=1)*|BJ=B" - (1 — At) V|. 


To conclude the proof, note that the nonzero roots of |BJ~'B’ — V| = 0 and 
those of |B’V-'B — J| = 0 are identical [see Rao (1973, page 68)] and BJ~'B’ 
and B’V~'B are nonnegative definite implies that (9) reduces to 


k r 
ZV'Z=VLEPrA+ } g, 


tom} to p+] 


where A, E€ (0,1) fort =1,..., k. 0 


The treatment of random change of time on pages 144-145 of Billingsley 
(1968) can be applied here. He shows that if ®,, is a random monotone function 
which converges in probability to a function ® with ®,, and ® having the same 
finite domain then the random composite function Êy ° ®, converges weakly to 
the Gaussian process Zo ©. The asymptotic distributions of the test statistics Qy 
and Qy given in Theorems 2 and 3 then hold for random partition points which 
depends on ®,,. For example, in our chi-square goodness-of-fit application, one 
can truncate the fitted survival function at t = T and define F*(t; ôy) = F(t §y) 
for — œ < £< T and 0 otherwise. Then use ®,( P) = F*~'(P; Â.) fr0<P<1 
to produce random cell boundaries based on specified values on the survival scale 
l> P >=- > P >O. For a sample of size N one might need to reduce the 
number of cells from r* to fy where fy = max{i: F*~'(P; by) < T}. Then Fy, 
converges in probability to r where r = max{i: P, > F(T; 0)}. The asymptotic 
distribution of the test statistics Q, and Qy then holds for the random partition 
points îy, = F*~'(P, 6y). 


4. Computation. Let there be c censored observations with censoring times 
ti < t3 < --- < t. Define to = 0 and tř}; = œ. Let m(s) = max{J; % <s). 
Then, the uncorrected covariance matrix V and the generalized Pearson statistic 
Q(8,,) reduce to the following simple forms: 


V,, sa F(t; by) F(t,; by) d, 


and 


Qn (8) = NY {Fy (t,-.)/F(t,_.13 by) _ Fy (t,)/F(t,; 6,)}/(d, = did); 


i= 1 


764 M. G. HABIB AND D R. THOMAS 
where 


d= 


I 


f f(y; În) 
o AF? y; Oy) > 
m(t) 


=) (1/F(t*; bn) =e 1/F(t_,; 6y)}/Hy(t*) 


j=} 


EYE thea by] = BGs În) /Biw( tc): 


Unfortunately, the computation of the modified Pearson statistic Q y(On), 
which uses the corrected covariance matrix $, is not as simple as that of Q lO Í 
However, one could take advantage of the identity > a = V-' + Ĉ, where 
C= V-'B(B'V-'B- J)'BV-! to get Qy(On) = Qy(Gy) + ZyCZy. Thus, if 
Q,(6y) is greater than x?, ,, then Q,(8,) would be also. Further, because 
Z\,CZ,, is bounded by x2, it following that, if Q,(8,) is less than y2_,(a) then 
@ (9x) would be less than x2(a). Hence, Ô y(8x,) need only be computed when 
x7_,(a) < Qx(8x) < x2(a). In such a case, the components of the information 
matrix J are required. They are given by 


c+] 


J, = p3 Hy(tt,){K,(th1) -E Ctr) 


c+! 9? In F(t*; 6 : : 
an ee ie by) {Hy (t) — Hy(é)}, 


l=] 
where 


07 In f(t; By) 
K,,(s) al 30, 30, 


These components, K „(8), require numerical integration in some applications, 
for example the two-parameter Weibull and gamma distributions. 


f(t; by) dt. 


Acknowledgment. We are grateful to a referee for suggesting the current 
version of Theorem 1 which permitted inclusion of random cell boundaries. 


REFERENCES 


BILLINGSLEY, P (1968). Convergence of Probability Measures. Wiley, New York. 

BRESLOW, N. and CROWLEY, J (1974). A large sample study of the life table and product limit 
estimates under random censorship. Ann. Statist. 2 437-453. 

CHEN, J. (1975) Goodness of fit tests under random censorship. Ph.D. thesis, Dept. Statistics, Oregon 
State Univ. 

CHERNOFF, H and LEHMANN, E. L. (1954). The use of maximum hkelihood estimates ın x? tests for 
goodness of fit. Ann. Math. Statist. 25 579-586. 

FLEMING, T. and HARRINGTON, D. (1981). A class of hypothesis tests for one and two sample 
censored survival data. Comm. Statist. A—Theory Methods 10 763-794. 

FLEMING, T., O’FALLON, J. and O'BRIEN, P. (1980). Modified Kolmogorov—Smirnov test procedures 
with application to arbitrarily mght-censored data. Biometrics 36 607-625. 


GOODNESS-OF-FIT TESTS 765 


Hapis, M. G. (1981). A chi-square goodness-of-fit test for censored data. Ph.D. thesis, Dept. 
Statistics, Oregon State Univ. 

HOLLANDER, M. and Proscwan, F. (1979). Testing to determine the underly-ng distribution using 
randomly censored data. Biometrics 35 393-401. 

KAPLAN, E. I. and MEIER, P. (1958). Nonparametric estimation from incomplete observations. +. 
Amer. Statist. Assoc. §3 457-480. 

KOZIOL, J. A. and GREEN, S. B. (1976). A Cramér—von Mises statistic for randomly censored data. 
Biometrika 63 465-464. 

MIHALKO, }). P. and Moore, D. S. (1980). Chi-square tests of fit for Type II censored data. Ann. 
Statıst. 8 625-644. 

Moore, I). S. (1977). Generalized inverses, Wald's method, and the construction of chi-square tests 
of fit. J. Amer. Statist. Assoc, 72 131-137. 

MOORE, I). S. and SPRUILL, M. C. (1975). Unified large sample theory of general chi-square statistics 
for tests of fit. Ann. Statist. 3 599-616. 

Nair, V. N. (1981). Plots and tests for goodness of fit with randomly censored data. Biometrika 88 
99-103. 

NIKULIN, M. (1973). Chi-square tests for continuous distributions with shift and scale parameters. 
Theory Probab. Appl. 18 559-568. 

PIERCE, D. A. (1982). The asymptotic effect of substituting estimators for parameters in certain 
types of statistics. Ann. Statist. 10 475-478. 

PYKE, R. (1969). Applications of almost surely convergent constructions cf weakly convergent 
processes. Lecture Notes in Math. 88 187-200. Springer, New York. 

Rao, C. R. (1973). Linear Stattstical Inference and Its Applications, 2nd ed. Wiley, New York. 

Rao, K. C. and Rosson, D. S. (1974). A chi-square statistic for goodness-cf-fit tests within the 
exponential family. Comm. Staust. 3 1139-1153. 

TURNBULL, B. W. and Weiss, L. (1978). A likelihood ratio statistic for testing goodness-of-fit with 
randomly censored data. Biometrics 34 367-375. 


DEPARTMENT OF STATISTICS 
OREGON STATE UNIVERSITY 
CORVALLIS, OREGON 97331 


The Annals of Stathtcs 
1986, Vol 14, No 2, 766-773 


SOME ASYMPTOTIC PROPERTIES OF KERNEL ESTIMATORS 
OF A DENSITY FUNCTION IN CASE OF CENSORED DATA 


By JAN MIELNICZUK 
Polish Academy of Sciences 


The kernel estimator is a widely used tool for the estimation of a density 
function. In this paper its adaptation to censored data using the Kaplan-Meier 
estimator is considered. Asymptotic properties of four estimators, arising 
naturally as a result of considering various types of bandwidths, are investi- 
gated. In particular we show that (1) both proposed estimators stemming from 
the nearest neighbor estimator have censoring-free variances and (ii) one of 
them is pointwise mean consistent. 


1. Introduction. Consider the random censorship model with two sequences 
X,,---, A, and Y,,..., Y, of i.i.d. nonnegative random variables such that X, Y, 
are independent (i = 1,..., n). Let F and G be the unknown right continuous 
distribution functions of the X ’s and the Y ’s, respectively. It is assumed that X, 
and Y, have densities f and g with respect to Lebesgue measure on R!. We want 
to estimate f using the following data: 


Z,=min(X,,Y¥), 6 =[X,<Y,], i=1,...,n, 


where [A] for any event A denotes the indicator function of A. Let H be the 
distribution function of the Z’s and let Z,),...,Z,) denote the ordered sample. 
The well-known KM estimator (Kaplan and Meier (1958)) is defined by 


n—-t \% 
iS a <Z,., 
nit) TL; ae Bo il i n 


= 0, U > Žin 


ô being the concomitant of Z,,). The KM empirical survival function 1 — F, will 
be denoted by F. 

Blum and Susarla (1980) introduced a kernel-type estimator of f, considered 
then by Féldes, Rejté, and Winter (1981b). The estimator is based on the KM 


estimator: 
1 


way SE iy) BO 


Received June 1984; revised July 19865. 

AMS 1980 subject classifications. Primary 62G05; secondary 60F15. 

Key words and phrases. Censored data, density estimator, k nearest neighbor estimator, 
Kaplan-Meier estimator, kernel, random censorship model. 


766 








(1.1) f,(x) = 


KERNEL DENSITY ESTIMATOR 767 


where h(n) is a sequence of positive numbers such that A(n) — 0, nh(n) > œ, 
and K is a density function. Analogously, we define a k(n)th nearest uncensored 
neighbor estimator 


(1.2) f(x) = walk EA (9); 


where R(n) is the distance from x to its k(n)th nearest uncensored neighbor and 
k(n) is a given sequence of integers such that k(n) — œ and k(n)/n — 0. 
Tanner (1983) used this type of bandwidth in hazard rate estimation from 
censored data. Classical nearest neighbor estimators were studied by Moore and 
Yackel (1976, 1977), Mack and Rosenblatt (1980), and Mack (1980). Moreover, we 
introduce 


(1.3) fi(x) = Ge f pred dF (y), 


(1.4) f(x) = FR I x | dF (y), 


where n, = 7,8, is the number of uncensored (5, = 1) observations. Two other 
estimators were studied by Blum and Susarla (1980) and McNichols and Padgett 
(1984a). A thorough survey of recent results in density estimation for censored 
data is given in McNichols and Padgett (1984b). 

We shall show that some properties of estimators (1.1)-(1.4) may be deduced 
from the properties of classic kernel estimators when the observations are not 
censored. The connections with the classical case are stated in Lemma 1 for Í, 
and f, and in Lemma 2 for j* and f*. The link is evident in the second case since 
then the uncensored observations may be treated as n, random variables distrib- 
uted as (Z|ô = 1) where (Z, ô) ~ v 5,) for any i. In the first case we consider 


Fy) = : f [Z, <y, 8, = 1], 
=} 


which, on any compact interval [0, d] may be interpreted as the empirical 
distribution function of some random variable W(d). In Secticn 3 some results on 
the consistency and weak convergence of the introduced estimators are proved. In 
particular it is shown that the asymptotic variances of f, and ff do not depend 
on censoring, as opposed to the asymptotic variances of f, and f*. Finally, we 
state a (pointwise) mean consistency result for f*. 


2. Classical analogues for f la f, and /f*. Put p= P(8 = 1) and 
q = 1 — p. Observe first that defining R(n) as the distance to the k(n)th 
uncensored observation leaves R(n) undefined on the set A, = {k(n) > ni} 
However this has no influence on the asymptotic properties of R(n): If no is such 


768 J. MIELNICZUK 
that n > n, implies np — k(n) > npq then 


F P(A)= F P(Bin(n, p) < h(n) 


n=l nol 


2 ¥ P(Bin(n, p)— np > ww k(n) 


n=] 


<n o+ }, P(|Bin(n, p) — np| > npq) 


<na+ }, 2exp(—2/9npq) < œ. 
n=No 


(At the end of this argument Bernstein’s inequality was used; cf. Rényi (1970).) 

From now on we denote by x a fixed point of R* such that f(x)G(x) > 0 
where G = 1 — G. Let Wx) = Z - [8 = 1] + (Z + x + 1)- [ê = 0] and let W(x) 
for 1 = 1,..., n be defined in the same way as W(x) with Z and 6 replaced by Z, 
and ô, respectively. Obviously, (W,(x),...,W,(x)) is an 11d. sequence with 
W(x) distributed as W(x) for any i. Let R(n) be the distance from x to its 
k(n)th nearest neighbor in the sequence (W,(x),..-, W,(x)). For any function f 
denote by C( f ) the set of continuity points of f. 


LEMMA 1. If k(n)/loglogn > œ and x € C( f - G) then 

(i) R(n) = R(n) for all but finitely many n a.s. 

(il) k(n)/2nR(n) > f(x)G(x) a.s. 

PRooF. Using the results of Moore and Yackel (1976) we have 
R(n)>0 as. 


Thus the first k(n) neighbors lie in a small neighborhood of x and by the 
definition of R(n) they must be uncensored. Since on [0,x + 1) the random 
variable W has a density which is continuous at the point x, assertion (ii) follows 
from (1) and Theorem 1 of Moore and Yackel (1976). O 


Note that F (y) is the e.d.f. of W(x) on [0, x + 1) and thus we may estimate 


f(x)G(x) by means of 
way Eley] HOD 


or 


Remy LE Ray] BO 


using a kernel K which has a compact support. 


KERNEL DENSITY ESTIMATOR 769 


THEOREM 1. Let K be a bounded density function with support in | —1,1]. 
Assume that x € C( f - G) AO C(g). 


(i) If k gi — œ then, with probability 1, 


1/2 
nla) — graa LE [Fac | Halo) = O((loetog n/m)") + O(K(n)/n), 
(ii) If nh(n)/log log n > œ then, with probability 1, 


soe fas A = oglog n/n)” n 
hle) ~ Bayan) [x|] aby) O((loglog n/n)“") + O(h(n)). 


PROOF OF (i). Let S(x,r) = {y: |y- x} sr}, 


1 x—-Y)\ g 
ie) ~ carey GE tn) aP) 





kin = 
< sup K =~ 1/nG(x) _ a,(2,)I; 


max 
R(n) 2,€S(x, Rin) 
B= 


where a,(z,) is the value of the jump of the KM estimator in z,. Using (ii) of 
Lemma 1 it is enough to show that 


k(n) 

2.1 1/G(x) - = O((log! me o/52), 

(2.1) | max „/GC) — na,(z,)|= O((loglog n/n)") + O| — 
821 

Since na,(z,) = F(z, — 0)/H,(z, — 0) (Efron (1967)), where H, is the empirical 

survival function for all observations, it follows that (2.1) is majorized by 




















F(t) F(t) F(t) F(t) 
tsx+ R(n) H,(t) 7 H(t) t<x+R(n) H(t) 7 H(t) 
1 1 
y ue Ris G(x) 7 G(t) 








The two first terms are O(loglogn/n)'”* as. in view of the LIL for the 
Kolmogorov—Smirnov distance (cf. Serfling (1980)) and the result of Földes and 
Rejtö (1981a), respectively. Since the last term is majorized by (G(x + R(n)) — 
G(x — R(n)))(G(x + R(n)))~? then (i) follows from 


(G(x + R(n)) — G(x — R(n)))/R(n) > 2g(x) as. 
and nR(n)/k(n) = O(1) as. O 


PROOF OF (i1). The proof follows the lines of (i) with the equality 
(2.2) lim 2 [Z, € S(x, A(n)), 8, = 1] /nh(n) = 2f(x) - G(x) 
Oe al 


used instead of (ii) of Lemma 1. Formula (2.2) is obvious in view of the strong 


770 J. MIELNICZUK 


consistency result for the classical kernel estimator with the kernel uniform on 
{—1,1] applied to (W,(x),...,W,(x)). D 


Let us turn to f* and f*. Consider (Z| = 1) instead of W(x), and replace 
W(x),...,W,(x) by the sequence of uncensored observations, denoted by 
Zis- Zap The density of (Z| = 1) is equal to f(x) = f(x)G(x)/p. 


LEMMA 2. Let k be a fixed integer < n. Conditionally on n, = k, Z,,...,Z,, 
is an i.t.d. sequence with density f. 


i 


From Lemma 2, for T (t) = ny'L?_,[Z, < t, 6, = 1] the estimators 


can be viewed as the classical kernel estimators for f,, based on a sample of 
random size n,. From Lemma 2 it is also easy to see that asymptotic properties of 
these estimators such as convergence in probability and weak convergence, are 
the same as the respective asymptotic properties of analogous estimators based 
on n observations from the distribution of (Z|8 = 1). Moreover, an exact ana- 
logue of Theorem 1 is true for f* and fe and its proof is similar to that of 
Theorem 1. Theorem 1 and its analogue allows us to study the asymptotic 
properties of the proposed estimators. 


3. Asymptotic properties of the proposed estimators. Below we list 
some properties of the estimators f, and f„. They rely upon analogous properties 
of their classical counterparts and, as for the consistency results, on the following 
theorem of Moore and Yackel (1977). Let K satisfy the assumptions of Theorem 
1 and the condition K(cu) = K(u) for any 0< ¢ < 1. For any fixed sequence 
k(n) consider an arbitrary consistency result holding for the estimator with 
kernel K and bandwidth h(n) = k(n)/n and also for the estimator with the 
uniform kernel and bandwidth A(n). Then this result holds for the nearest 
neighbor estimator with kernel K and the bandwidth based on k(n). The only 
qualification to this argument is that the conditions on k(n) must also be 
satisfied by ah(n) for any a > 0. We assume that the conditions of Theorem 1 
are satisfied. 


COROLLARY 1. (i) Jf LY_,exp(—ck(n)) < +œ for every c> 0, K(cu) z 


n=l 


K(u) for 0 < c <1, then 
f(x) ~ fx) >0 as. 
(ii) If E%_,exp(—enh(n)) < +œ for every c > 0, then 
f(x) —f(x) 7-0 as. 


Corollary 1(1i) is an analogue of Theorem 1 in Devroye and Wagner (1979) in 
the case of censored data. 


KERNEL DENSITY ESTIMATOR 771 


If K is the uniform kernel on [—1,1] then f(x) and f(x) are strongly 
consistent under the assumptions of Theorem 1. 


COROLLARY 2. Let g be continuous and let f - G be continuous and positive. 
Assume that H(T) > 0. 

(i) Suppose that K is continuous and K(cu) > K(u) for 0 < c < 1. If k(n) is 
a sequence of integers such that k(n)/log n > + œ then, with probability 1, 


(3.1) lim n falx) — f(x)| = 0. 
n Osxs 
(ii) If K is a continuous kernel then (3.1) holds with f„ replaced by f,. 


Corollary 2(ii) is a censored data version of Theorem A in Silverman (1978). 
The proof of Corollary 2 relies on the fact that F (y) is the e.d.f. of W(T) on 
[0, T + 1). 


PROOF OF (i) By the aforementioned theorem of Silverman (1978) it is 
enough to show that strong convergence in Theorem 1 can be replaced 
by uniform strong convergence on [0,7]. To see this observe that since 
k(n)/log n — œ and f- G is uniformly continuous on [0, T], in view of Theorem 
1 in Devroye and Wagner (1977) we have 
(3.2) sup |k(n)/nR(n, x) — 2f(x)G(x)|>0 as. 

OsxsT 
It remains to consider the last term of the majorant occurring in the proof of 
Theorem 1 and to show that 
sup G(x + R(n)) - G(x- R(n)) = O(R(n)/n) as. 
O0<x<T 
We have 
sup G(x + R(n)) — G(x — R(n)) 


O<x<T 


m k(n) nR(n) (G(x + R(n)) — G(x — R(n))) 
oexer n R(n) R(n) 
Since sup R(n) on [0, T] tends to 0 a.s. and g is uniformly continuous we have 
sup |(G(x + R(n)) — G(x — R(n)))/R(n) — 2g(x)|>0 a.s. 


OsxsT 
Thus the proof of (i) is completed in view of (3.2) and the fact that f- G is 
positive on [0, T]. O 








The proof of (ii) is similar. 


REMARK. Observe that the uniform strong convergence of A on [0,T] is 
obtained, with the stronger condition on k(n): Lexp(—cnh*) < +00 for all 
positive c and with K of bounded variation, using the result of Nadaraya (1965) 


772 J. MIELNICZUK 


and the inequality (Földes and Rejtö (1981a)) 
P| sup |F (x) — F(x)|> e) < dẹæxp( —ne*S‘d,), 
_ O<sxsT 
where H(T) > ô > 0 and e > 2'/nd*, dọ, d, being universal constants. 


COROLLARY 3. (i) Assume that f - G has a bounded derivative in a neighbor- 
hood of x. If k(n) = o(n?”*) then 


(83) (WCRE) - 1E) >e {0,2/%(2) fK) ay], 


(ii) Assume that K is an even function, f has a second derivative which is 
bounded in a neighborhood of x, and h(n) = O(n~'/*). Then 


ea maa ae n(o, (He) /G (a) [5% a). 


The corresponding uncensored data theorems are given in Moore and Yackel 
(1976) and in Rosenblatt (1971). 


PROOF OF (i). Observe that for w,(x) = (1/R(n)){pK(x — y)/R(n)) dF,(y) 


we have 


(3.5) (k(n))'"(w,(x) — f(x)G@(x)) >e N(0,2( f(x)G(x))” RK?) dy) 
(Moore and Yackel (1976)). Since k(n) = o(n?’*) implies [k(n] = o(n/kR(n)), 
(i) follows from (3.5) and Theorem 1. 0 


PROOF OF (ii). Rosenblatt (1971) proved that under the conditions imposed 
on K in (ii) and h(n) = 0(n-*), w(x) = (1/h(n)) pK (x — y)/A(n)) diy) 
is asymptotically normal with mean f(x)G(x) and asymptotic var- 
iance 1/(nh(n))f{(x)G(x){pK *(y) dy. The result follows from the fact that 
(nh(n))'” = o(1/h(n)) for h(n) = o(n™°®). O 


Observe that (i) asymptotic variance of f„ does not depend on censoring and 
(ii) analogues of Corollaries 1, 2, and 3 for f* and f* are also true. The only 
difference is that in Corollary 3 the scaling sequences (nh(n))' and (k(n))7 
are replaced by (n,A(n,))'” and (k(n,))', respectively. 

We also state (Mielniczuk (1985)): 


THEOREM 2. Assume that conditions of Corollary i(i) are satisfied. Suppose 
that logn : k(n)/n > 0, f, is a bounded density function in a neighborhood of 
x, and x satisfies f(x)G(x) > 0. Then 


JIRE) - f(a) dP > 0. 
This theorem is a censored data version of Theorem 4 of Moore and Yackel 


(1976). Basically, the proof of Theorem 2 is parallel to the proof of the corre- 
sponding theorem. 


KERNEL DENSITY ESTIMATOR 773 
Acknowledgment. I thank J. Koronacki for his comments. 


REFERENCES 


BLUM, J. R. and SUSARLA, V. (1980). Maximal deviation theory of density and failure rate function 
estimates based on censored data. In Multwartate Analysis 5 (P. R. Kmshnaiah, ed.) 
213-222, North-Holland, New York. 

DEVROYE, L. P. and WacGner, T. J. (1977). The strong uniform consistency of nearest neighbor 
density estimates, Ann. Statist. 8 536-540. 

IDEVROYE, IL. P. and WAGNER, T. J. (1979). The L! convergence of kernel density estimates. Ann. 
Statist. 7 1136—1139. 

EFRON, B. (1967). The two-sample problem with censored data. Proc. Fifth Berkeley Symp. Math. 
Statist. Prob. 4 831-853. 

FGLDES, A. and Resto, L. (1981a). A LIL type result for the product limit estimator. Z. Wahrsch. 
verw. Gebete 386 75-86. 

FÖLDES, A., REJTÖ, L. and WINTER, B. B. (1981b). Strong consistency properties of nonparametric 
estimators for randomly censored data, II: Estimation of density and failure rate. Period. 
Math. Hungar. 12 156-29. 

KAPLAN, E. L. and MEIER, P. (1958). Nonparametric estimation from incomplete observations. J. 
Amer. Statist. Assoc. 53 457-481. 

Mack, Y P. (1980). Asymptotic normality of multivariate k-NN density estimates. Sankhya Ser. A 
42 63. 

Mack, Y. P. and ROSENBLATT, M. (1979). Multivariate k-nearest neighbor density estimates. J. 
Multwariate Anal. 9 1-15. 

McNicHo.s, D. T. and PApGett, W. J., (1984a). A modified kernel estimator for randomly right 
censored data. South African Statsi. J. 18 13-27. 

McNIcHOLs, D. T. and PADGETT, W. J. (1984b). Nonparametric density estimation from censored 
data. Comm. Statist. A~~Theory Methods 13 1581-1611. 

MIELNICZUK, J. (1985). Kernel estimators of a density function in case of censored data. ICS PAS 
Report No. 560, Warsaw. 

Moore, DD. S. and YACKEL, J. W. (1976). Large sample properties of nearest neighbor density 
function estimators. In Statistical Decision Theory and Related Topics (S. S. Gupta and 
I). S. Moore, eds.) 269-279. Academic, New York. 

Moore, I). S. and YACKEL, J. W. (1977). Consistency properties of nearest neighbor density 
estimates, Ann. Statist. 5 143-154. 

NabDARAYA, E. A. (1965). On nonparametric estimates of density functions and regression curves. 
Theor. Probab. Appl. 10 186-190. 

RÉNYI, A. (1970). Probabuity Theory. Akademiai Kiadó, Budapest. 

ROSENBLATT, M. (1971). Curve estimates. Ann. Math. Statist. 42 1815-1842. 

SERFLING, R. J. (1980). Approxunation Theorems of Mathematcal Statstics. Wiley, New York. 

SILVERMAN, B. W. (1978). Weak and strong uniform consistency of the kernel estimates of a density 
and its derivatives. Ann. Statsst. 6 177-184. 

TANNER, M. A. (1983). A note on the variable kernel estimator of the hazard function from randomly 
censored data. Ann. Statist. 11 994-997. 


INSTITUTE OF COMPUTER SCIENCE 
POLISH ACADEMY OF SCIENCES 
P.O, Box 22 

PL-00-901 WARSAW, PKIN 
POLAND 


The Annals of Statutes 
198b, Vol 14, No 2, 774-780 


A LARGE DEVIATION RESULT FOR SIGNED LINEAR RANK 
STATISTICS UNDER THE SYMMETRY HYPOTHESIS 


By TIEE-JIAN Wu 
University of Houston 


A Cramér type large deviation theorem for signed linear rank statistics 
under the symmetry hypothesis is obtained. The theorem is proved for a wide 
class of scores covering most of the commonly used ones (including the normal 
scores). Furthermore, the optimal range 0 < x < o(n'“‘) can be obtained for 
bounded scores, whereas the range 0 < x < o(n?), 8 € (0, 1) is obtainable for 
many unbounded ones. This improves the earlier result under the symmetry 
hypothesis in Seoh, Raleacu, and Puri (1985). 


1. Introduction and statement of the main theorem. For n> 1 let X,,, 


y= 1,..., n, be independent and identically distributed random variables distrib- 
uted according to the cumulative distribution function F. We assume that 


(1.1) F is continuous and symmetric about zero. 


Let R„ be the rank of |X,,| among [X,,|,..., |X,,,|. We shall consider the signed 
linear rank statistics of the forms 


n 
(1.2) Sa = 2; Cmar R,,,)sgn( Xn), n = 1,2,..., 


pm] 


where Cpi». -s Can are known constants, a,(1),..., a(n) are known real numbers 
(called scores), and sgn(x) = 1 or —1 according as x > 0 or < 0. Under suitable 
assumptions on the ¢c,,’s and a,(i)’s, the asymptotic normality of S, has been 
established [Húsková (1970) and Hajek and Sidak (1967)]. Recently, Seoh, 
Ralescu, and Puri (1985) obtained a Cramér type large deviation theorem with 
range 0 < x < o(n'’*) for the statistic S, with bounded scores (in fact, they have 
considered the so-called generalized rank statistics which include S, as a special 
case). The purpose of this paper is twofold. In the first place we extend their 
assertion on the large deviation probabilities for S, under the symmetry hypothe- 
sis to a wide class of scores covering unbounded ones (including the normal 
scores). Secondly, we show that under the symmetry hypothesis the optimal 
range 0 < x < o(n'*‘) can be obtained for bounded scores. It should be remarked 
that in case of symmetry the optimal range equals 0 < x < o(n'*), while in 
general the optimal range is 0 < x < o(n'/*) [cf. Feller (1971), page 553]. We also 
note that Kallenberg (1982) studied the same problem in the case of (unsigned) 
simple linear rank statistics with bounded scores. 


Received January 1985; revised July 1985. 

AMS 1980 subject classifications. Primary 60F10; secondary 6220. 

Key words and phrases. Signed linear rank statistics, score generating function, large deviation 
probabilities. 


774 


LARGE DEVIATIONS FOR RANK STATISTICS 775 


Throughout the paper we make the following assumptions (7, is some positive 
integer): 


(1.3) č #0, max |c,, — ¢,|< Alen, n 2 no, 
lsisn 


where A, > 0 is an absolute constant, 6, > 0, and ¢, = ao ae ae 


a -1/2 
(14) max |ay(i)|>0, nzn (max aD) Èa) >o 


t=] 


as n -> œ. 


REMARK 1.1. We give two examples of constants c,, satisfying (1.3): 


l. c,, # 0 and Cp, = Cn for all ¿ = 1,..., n and n = 1,2,...; 
— —_ _ 2a —ĝ, ~— 
2. |,| 2 n77, a 2 0, MAX; <, nln — ¢,| = O(n “~%). 


REMARK 1.2. Note that (1.4) is the only assumption we are making on the 
scores. However, they usually are generated by a real-valued Borel measurable 
function $(u), 0 < u < 1, in either one of the following two ways: 


(1.5) a,(i)=$(i/(n+1)),  i=1,...,N, 
(1.6) a,(i) = E¢(U), i=1,...,n, 


where U‘” denotes the ith order statistic in a random sample of size n from a 
uniform distribution over (0,1). Now, suppose the score generating function @¢ 
satisfies: 


the set {u: ¢(u) # 0} has positive Lebesgue measure and ¢ can be 
expressed as a finite linear combination of monotone functions 

(1.7) {---,$,} with | (u) < M[u(l — u) +t" for every j= 
l,..., k and u & (0,1), where M is a positive constant and 0 < 
5, < t. 


By Theorem V.1.4a and Lemma V.1.6a of Hajek and Šidák (1967), we obtain 
n 
0< lim n™! $ az(i)=|l6ll3 < 0, 

(1.8) Sa al 

max |a,,(i)|=O(n'/?~*) 

l<isn 
for both the cases (1.5) and (1.6). Since (1.8) clearly implies (1.4), thus any score 
generating function satisfying (1.7) generates scores satisfying (1.4). It may be 


noted that unlike in earlier papers referred to above where bounded derivatives of 
different orders on p are assumed, here (1.7) is the only assumption on ¢ we need. 


776 T -J. WU 


Let ® denote the standard normal distribution function. We denote for 
n=1,2,... 


P -1/2 
(1.9) M, = ( max laD Dok) 
lstsn i=j 
(1.10) Tr = DA 
t=] 
(1.11) E D E DAE 
t=] j=l 
(1.12) b, = min(A; 7, n°/?), 


Note that 6b, depends only on the scores and ô, the magnitude of the latter 
depending on that of max, .,<,,|¢,,€, | — 1| [see (1.3)]. Obviously, (1.4) and (1.9) 
imply à} > œ and à; < n'/4. Thus 

(1.18) bp > œ and b, <min(n'4, n°”). 


The main result of the paper is the following: 


THEOREM 1.1. Under assumptions (1.1) and (1.8)-(1.4), we have as n > œ 
that 


(1.14) sup (1 — F,(x))(1 - ®(x))' - 1|> 0, 
(1.15) sup (1 - G,(x))(1 - (x) - 1] > 0, 


where F and G, are the cdf ’s of Sr, ' and S,o,', respectively, I, denotes the 
interval (0, p,6,|, and p,, n21, is an arbitrary sequence of positive real 
numbers with lim „Pn = 0. 

REMARK 1.3. From (1.13), we get I, c (0, p,n'/4] A (0, p në]. Let us con- 
sider the case that the scores are generated by according to either (1.5) or (1.6) 
with ọ satisfying (1.7). We obtain from (1.8)-(1.9) that A,n®:/? < 71”? for all 
sufficiently large n, where A, > 0 is a constant (independent of n). Thus the 
range I, covers the range 0 < x < o(n’), 8 = min(6,/2, 5,/2). [Note that for 
unbounded ¢ we have ô, < + by (1.7), hence ô < + in this case.] For example, the 
range 0 < x < o(n'/®**), 0 < 6’ < 4, is obtained when 5, = 1 + 28’ and 6, > 
t + 26’ [in this case, max, <, < nlCn ch — 1| = O(n 27) and @ can be un- 
bounded]. The widest range 0 < x < o(n'”*) is obtained if and only if 6, = 6, = 1 
[in this case, max, <,<,lCgiC, | — 1| = O(n™'”*) and ¢ is bounded]. For the one 
sample normal scores test or van der Waerden’s test [with (u) = ~ ‘((u + 1)/2) 
in (1.6) or (1.5), respectively] it holds that (log n) < max, .,.,|a@,(t}| = 
a,(n) < (2logn)'” for all sufficiently large n, which can be seen from Lemma 
VII.1.2 of Feller (1968) and from Section 4.4 of David (1981). It follows from (1.8), 
(1.9), and (1.12) that for both tests the widest possible range is 0 < x < 
o(n'“4(log n)~ '/*) (when 6, = 4). 


LARGE DEVIATIONS FOR RANK STATISTICS 777 
To prove the theorem, we shall approximate S, by the statistic 


(1.16) ce a aca. 


t=} 


Let D=(D,,,..-, Din) be the vector of antiranks associated with R = 
(Ry---» R,,) Then T, is equivalently expressible [in its dual form to (1.16)] as 


(1.17) T, = Ch ny a,(i)sgn(X,,p,,). 
ra] 


But sgn(X,,p_,),---,8gn(X,,p__) are independent and identically distributed r.v.’s 
under assumption (1.1) with common symmetric Bernoulli distribution [see 
Theorem 19C of Hajek (1969)]. Therefore T, is actually expressible as a sum of 
discrete independent r.v.’s. A Cramér type large deviation theorem is applied to 
T,, whereas a multinomial expansion is made use of to estimate the distance 
E|S, — T,|*” for any p € [1, n]. In the sequel we suppress the index n whenever 
it is possible. 


2. Some lemmas and the proof of the main theorem. The following 
lemma deals with the large deviation probabilities of T,. 


LEMMA 2.1. Under the assumption of Theorem 1.1, it holds true as n > œ 
that 
sup |(1 - H,(x,))(1 - (x))"' - 1|— 0, 


xE 


where |x, — x| = b; ', and H,, denotes the cdf of Tr '. 


Proor. From (1.1), (1.8), (1.9)-(1.10), and (1.17), we get Var T, = t? and for 
all n 


(2.1) \2,@,(é)sgn( Xn) |S Ans i= 1,..., N. 


It then follows from (1.4), (1.9), (1.12), (2.1), and from Theorem 1 of Feller (1943) 
that there exist n, > 0 such that for all n > n,, and x € (b; ', p,b,] 


(2.2) 1— H,(x,) = exp|-27'x2@,(x,)| {1 — ®(x,) + 6,A,,exp(-27'x2)}, 
where 

nl<9, Qull) = È QnXh 
(2.3) nee 

dni =, la nil <T LA); w= 2,3,... . 


Note that g,, = 0 because the third moment of ¢,a,(z)sgn( Xp ) vanishes for all 
¿= 1,..., under assumption (1.1). Clearly (2.3) implies for all sufficiently large 
n that 


(2.4) (x2Q,(x,)| S77 '(144)(A,x2) (1 - 12A,2,) 


A ert 


778 T.-J. WU 


which converges to zero as n -> œ uniformly in x € (b7', p,b,]. Now, by Lemma 
VII.1.2 of Feller (1968), it can be readily seen that uniformly in x € (b; ', p,b, ] 


(2.5) d,exp(-27'x?2)(1 — ®(x))' > 0, 
(2.6) (1 - O(x,))(1- O(x)) > 1 
as n -» co. Combining (2.2)-(2.6) yields 

(2.7) (1 - H,(x,))(1 - @(x))7' - 1] 0 


uniformly in x € (b7 ', pab l. Next, by (1.1), (1.4), (1.17), and by Theorem V.1.2 
of Hájek and Šidák (1967), we have ||H,, — ®j|,, > 0, which implies uniformly in 
x € (0, b; '] 


(2.8) (1 - Halx,))(1 a E 1] > 0 


as n > œ. The proof follows from (2.7) and (2.8) immediately. O 
The following lemma gives us an upper bound for the distance E|S, — T|”. 


LEMMA 2.2. Under the assumptions of Theorem 1.1, for all n > no and real 
p & [1, n], 


(2.9) E\(S, — T) i < APpPn? Paige | 
n n 3P n 
where A, > 0 zs an absolute constant. 


Proor. By Hölder’s inequality, it is sufficient to prove (2.9) only for p = 
1,2,...,m. The dual form of S, is S, = Lijepa,(t)sgn(xp ). Thus, in view 
of (1.17), we get S, — Ta = 27.,@,(t)(ep — €,)sgn(X)p). Furthermore, 
let {p,,-..; Pa} be an arbitrary collection of nonnegative integers containing 
at least one odd number, then (1.1) implies E((17_,W?:) = 0, where 
W, = (€p, ~ ¢,)8gn( Xp ). It follows, by using the multinomial expansion, that for 
any p =1,...,n 


210) H(S,-T)"= EE E OKT (eli) ”e( wer 


m=] i,,€A(m) p, € B(m) j=l J= 
where 
ote hz), Pm = (Pires Pm) 
A(m) = {fix lsi < e <ia sny}, 
B(m) = (pn: : P, =P; p,=1,..., p for each i) 
j@l 
and 


CEP = (2p)!((2p,)! > (2Pm)!) 


LARGE DEVIATIONS FOR RANK STATISTICS 779 


Using the multinomial expansion again, we have 
n i P p m , 2p, 
en) (Lax) -E E E Te 
=l m=] i EAM) p, EBM) jel 
where CP? = p!((p,!)..-(Dm!)) '. Note also that 
(2.12) C2? < (2p)°CP, p=l....n, Py, € B(m). 


The proof follows from (1.3), (1.10), and (2.10)-(2.12) quickly. O 


PROOF OF THEOREM 1.1. We get by standard arguments that 
(2.13) —Q,+(1-H,(x + by')) <1- F(x) < (1-H,(x - b7')) + Q, 


where Q, = P[{S, — T,| > bi'm]. Put p, = Aj'e~'b?. Then (1.13) implies that 
p, € [l, n] for all sufficiently large n. It follows from Markov’s inequality, 
Lemma 2.2, and (1.12) that 


(2.14) Qn < (Agp,b2n-?)"* < eP, 
which, together with Lemma VII.1.2 of Feller (1968), imply uniformly in x € J, 
(2.15) Q,(1 = ®(x))' < e?(1 — ®(p,b,)) > 0 


as n > œ. (1.14) may be concluded from (2.13), (2.15), and Lemma 2.1 im- 
mediately. Next, from (1.3)-(1.4) and (1.10)—(1.11) it follows for all sufficiently 
large n that 
(2.16) lont, — 1] <4? lo — 72] < max |e — 1]? < A?n7™. 

lsisn 
By (1.12), (1.14), (2.16), and by Lemma VII.1.2 of Feller (1968), we obtain 
uniformly in x € I, 


(2.17) 1-G(x)=1-F(o,,7'x) = [1 — (o; x)| (1 + o(1)), 
(2.18) 1 — (o'x) = [1 -— 6(x)](1 + o(1)) 
as n -> œ. Now (2.17)-—(2.18) imply (1.15). This completes the proof. 0 


Acknowledgment. The author wishes to thank the referee and the associate 
editor for their helpful suggestions which improved the presentation of the paper. 


REFERENCES 


Davin. H. A. (1981). Order Statistics. 2nd ed. Wiley, New York. 

FELLER, W. (1943). Generalization of a probability limit theorem of Cramér. Trans. Amer. Math. 
Soc. 54 361-372. ; 

FELLER, W. (1968). An Introduction to Probabuity Theory and Its Applications 1. Wiley, New York. 

FELLER, W. (1971). An Introduction to Probability Theory and Its Applications 2. Wiley, New York. 

HÁJEK, J. (1969). A Course in Nonparametric Statistics. Holden-Day, San Francisco. 

HÁJEK, J. and Sid&k, Z. (1967). Theory of Rank Tests. Academic, New York. 

HuSKova, M. (1970), Asymptotic distribution of simple linear rank statistics for testing symmetry. 
Z. Wahrsch. verw. Gebiete 14 308-322. 


780 T.-J. WU 


KALLENBERG, W. C. M. (1982). Cramér type large deviations for simple linear rank statistics. Z. 
Wahrsch. verw. Gebiete 60 403-409. 

SEOH, M., RALESCU, S. S and PURI, M. L. (1985). Cramér type large deviations for generalized rank 
statistics. Ann, Probab. 13 115-125. 


DEPARTMENT OF MATHEMATICS 
UNIVERSITY oF Houston, UNIVERSITY PARK 
Houston, TEXAS 77004 


The Annals of Probability 


Vol. 14 July 1986 No. 3 
Special Invited Paper 
Asymptotic laws of planar Brownian motion ............. JIM PITMAN AND MARC YOR 
Articles 
Level crossings of a Cauchy process... .--- sees eevee eens JIM PITMAN AND MARC YOR 
Brownian motion and harmonic functions on rotationally 
symmetrie MANIIOIOS: eros ars ese n EEE APENE a axe Se eo eA PETER MARCH 


Characterisations of set-indexed Brownian motion and associated conditions 
for finite-dimensional convergence 
CHARLES M. GOLDIE AND PRISCILLA E, GREENWOOD 
Variance of set-indexed sums of mixing random variables 
and weak convergence of set-indexed processes 
CHARLES M. GOLDIE AND PRISCILLA E GREENWOOD 
On the rate at which a homogeneous diffusion approaches a limit, 
an application of large deviation theory to certain stochastic Bike 
ANIEL W. STROOCK 


Coupling of multidimensional diffusions by reflection 
TORGNY LINDVALL AND L, C. G. ROGERS 
On maximal and distributional coupling .....ssssesessessssso HERMANN ‘T‘HORISSON 
On the number of crossings of empirical distnbution functions 
IAS AN N. NAIR, LAWRENCE A. SHEPP AND MICHAEL J. KLASS 
On almost sure convergence of conditional empirical distribution functions 
WINFRIED STUTE 
Principle of conditioning in limit theorems for sums of random variables 
ADAM JAKUBOWSKI 
Conditions d’intégrabilité pour les multuplicateurs dans le TLC Banachique 
M. LEDOUX AND M. TALAGRAND 
On the rate of convergence ın the central limit theorem in Banach spaces ...... F. GÖTZE 
Decoupling inequalities for multilinear forms in independent symmetric 
random variables .......0 ..seeeee TERRY R. MCCONNELL AND MURAD S. TAQQU 
Random multilinear forms ............... WIESLAW KRAKOWIAK AND JERZY SZULGA 
The asymptotic distribution of sums of extreme values from a regularly 
varying distnbution ..........000ee eee SANDOR Cs6RGO AND Davip M. Mason 
Regular variation and the stability of maxima .............0000 220s R. J. TOMKINS 
The continuous and differentiable domains of attraction of the extreme value 
GISCRIDULIONS CE A E SENE dco wate Nt EE E E ewe oon JAMES PICKANDS IIT 
C™ densities for weighted sums of independent random variables ....... JAKOB I. REICH 
The symmetry group and exponents of operator stable probability measures 
WILLIAM N. HUDSON, ZBIGNIEW J. JUREK, AND JERRY ALAN VEEH 


Splitting intervals .......... cee eeee MICHAEL D. BRENNAN AND RICHARD DURRETT 
Random -expansions sas vcsceeccsiaieas cp ce ee VA See eee ee eee ee a JON. AARONSON 
Comments on a problem of Chernoff and Petkau .............0.5. MICHAEL L. HOGAN 
Random sets without se DAC cai ack hae hehe EE E A doa ae DaviD Ross 
Représentation prévisible et changement de temps . ......... CHRISTOPHE STRICKER 


The expected value of an everywhere stopped martingale 
S. RAMAKRISHNAN AND W. D. SUDDERTH 
On changing time for two-parameter strong martingales: A counterexample 
ETER IMKELLER 
Hunt’s hypothesis (H) and Getoor’s conjecture ..... JOSEPH GLOVER AND MURALI Rao 
A note on Feller’s strong law of large numbers .. . YUAN SHIH CHOW AND CUN-HUI ZHANG 


Correction 


Correction to “Weak and L?-invariance principles for sums of B-valued 
random variables yie a E 0.0% deed ink ce Sra “ave Bw Maes WALTER PHILIPP 






STATISTICAL 
SCIENCE 


a review journal of the institute of mathematical statistics 





EXECUTIVE EDITOR Morris H. DeGroot, Carnegte-Mellon University 
EDITORS David R. Brillinger, University of California, Berkeley 
J. A. Hartigan, Yale University 
Ingram Olkin, Stanford University 


On the occasion of its fiftieth anniversary, the Institute of Mathematical Statistics announces publication of Statistical 
Science, a new quarterly review journal in statistics and probability Statestical Science will present the full range of 
contemporary statistical thought at a modest technical level accessible to the broad community of practitioners, teachers, 
researchers, and students tn statistics and probability 


Volume | Number | 


February 1986 


discussions of methodological and. theoreti- 


cal topics of current interest and im orte nce, 
Freedman & Navidi on P P i 


MODELS FOR ADJUSTING THE CENSUS 
surveys of substantive research areas wih 


Efron & Tibshiran: on 
THE BOOTSTRAP 


Le Cam on 
THE CENTRAL LIMIT THEOREM 
AROUND 1935 


Geisser on 
THE COLLECTED WORKS OF 
GEORGE E. P BOX 


Genest & Zidek on 
COMBINING PROBABILITY 
DISTRIBUTIONS 


Interviews with 


DAVID BLACKWELL & T. W. ANDERSON 





promising statistical applications, 


evaluations of research papers in specific 
areas of statistics and probability, 


discussions of classic articles from the 
literature with commentary on their impact 
on contemporary thought and practice, 


comprehensive book reviews, and 


interviews with distinguished statisticians 
and probabilists. 





Statistical Science joins the Institute's distinguished journals, The Annais of Statistics and The Annals of Probability, m 
February 1986 and 1s included as a pnvilege of membership Nonmember subscnptions are available to individuals and 
organizations All subscriptions to the Annals include a subscnption to Sraristical Science for 1986. For additional 
information on how to receive Statistical Science, please write to the IMS Business Office 


IMS Business Office 
3401 Investment Boulevard #7 
Hayward, California 94545 (USA) 


ro ae ig e Y 


